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Abstract 


The  running  time  and  memory  requirement  of  a  parallel  program  with  dynamic,  lightweight  threads  depends 
heavily  on  the  underlying  thread  scheduler.  In  this  paper,  we  present  a  simple,  asynchronous,  space-efficient 
scheduling  algorithm  for  shared  memory  machines  that  combines  the  low  scheduling  overheads  and  good 
locality  of  work  stealing  with  the  low  space  requirements  of  depth-first  schedulers.  For  a  nested-parallel 
program  with  depth  D  and  serial  space  requirement  S% ,  we  show  that  the  expected  space  requirement  is 
Si  +  O  (K  ■  p  ■  D)  on  p  processors.  Here,  K  is  a  user-adjustable  runtime  parameter,  which  provides  a  trade¬ 
off  between  running  time  and  space  requirement.  Our  algorithm  achieves  good  locality  and  low  scheduling 
overheads  by  automatically  increasing  the  granularity  of  the  work  scheduled  on  each  processor. 

We  have  implemented  the  new  scheduling  algorithm  in  the  context  of  a  native,  user-level  implementation 
of  Posix  standard  threads  or  Pthreads,  and  evaluated  its  performance  using  a  set  of  C-based  benchmarks 
that  have  dynamic  or  irregular  parallelism.  We  compare  the  performance  of  our  scheduler  with  that  of  two 
previous  schedulers:  the  thread  library’s  original  scheduler  (which  uses  a  FIFO  queue),  and  a  provably 
space-efficient  depth-first  scheduler.  At  a  fine  thread  granularity,  our  scheduler  outperforms  both  these 
previous  schedulers,  but  requires  marginally  more  memory  than  the  depth-first  scheduler. 

We  also  present  simulation  results  on  synthetic  benchmarks  to  compare  our  scheduler  with  space-efficient 
versions  of  both  a  work-stealing  scheduler  and  a  depth-first  scheduler.  The  results  indicate  that  unlike  these 
previous  approaches,  the  new  algorithm  covers  a  range  of  scheduling  granularities  and  space  requirements, 
and  allows  the  user  to  trade  the  space  requirement  of  a  program  with  the  scheduling  granularity. 


1  Introduction 

Many  parallel  programming  languages  allow  the  expression 
of  dynamic,  lightweight  threads.  These  include  data  paral¬ 
lel  languages  like  HPF  [22]  or  Nesl  [5]  (where  the  sequence 
of  instructions  executed  over  individual  data  elements  are  the 
“threads”),  dataflow  languages  like  ID  [16],  control-parallel 
languages  with  fork-join  constructs  like  Cilk  [20],  CC++  [13], 
and  Proteus  [29],  languages  with  futures  like  Multilisp  [39], 
and  various  user-level  thread  libraries  [3,  17,  30,  43].  In  the 
lightweight  threads  model,  the  programmer  simply  expresses 
all  the  parallelism  in  the  program,  while  the  language  imple¬ 
mentation  performs  the  task  of  scheduling  the  threads  onto  the 
processors  at  runtime.  Thus  the  advantages  of  lightweight, 
user-level  threads  include  the  ease  of  programming,  automatic 
load  balancing,  architecture-independent  code  that  can  adapt 
to  a  varying  number  of  processors,  and  the  flexibility  to  use 
kernel-independent  thread  schedulers. 

Programs  with  irregular  and  dynamic  parallelism  benefit 
most  from  the  use  of  lightweight  threads.  Compile-time  anal¬ 
ysis  of  such  computations  to  partition  and  map  the  threads  onto 
processors  is  generally  not  possible.  Therefore,  the  programs 
depend  heavily  on  the  implementation  of  the  runtime  system 
for  good  performance.  In  particular, 

1.  To  allow  the  expression  of  a  large  number  of  threads,  the 
runtime  system  must  provide  fast  thread  operations  such  as 
creation,  deletion  and  synchronization. 

2.  The  thread  scheduler  must  incur  low  overheads  while  dy¬ 
namically  balancing  the  load  across  all  the  processors. 

3.  The  scheduling  algorithm  must  be  space  efficient,  that  is,  it 
must  not  create  too  many  simultaneously  active  threads,  or 
schedule  them  in  an  order  that  results  in  high  memory  al¬ 
location.  A  smaller  memory  footprint  results  in  fewer  page 
and  TLB  misses.  This  is  particularly  important  for  parallel 
programs,  since  they  are  typically  used  to  solve  large  prob¬ 
lems,  and  are  often  limited  by  the  amount  of  memory  avail¬ 
able  on  a  parallel  machine.  Existing  commercial  thread 
systems,  however,  can  lead  to  poor  space  and  time  perfor¬ 
mance  for  multithreaded  parallel  programs,  if  the  scheduler 
is  not  designed  to  be  space  efficient  [35]. 

4.  Today’s  hardware-coherent  shared  memory  multiprocessors 
(SMPs)  typically  have  a  large  off-chip  data  cache  for  each 
processor,  with  a  latency  significantly  lower  that  the  latency 
to  main  memory.  Therefore,  the  thread  scheduler  must 
also  schedule  threads  for  good  cache  locality.  The  most 
common  heuristic  to  obtain  good  locality  for  fine  grained 
threads  on  multiprocessors  is  to  schedule  threads  close  in 
the  computation  graph  ( e.g .,  a  parent  thread  along  with  its 
child  threads)  on  the  same  processor,  since  they  typically 
share  common  data  [1,  9,  25,  27,  31,  39]. 

Work  stealing  is  a  runtime  scheduling  mechanism  that  can 
provide  a  fair  combination  of  the  above  requirements.  Each 
processor  maintains  its  own  queue  of  ready  threads;  a  pro¬ 
cessor  steals  a  thread  from  another  processor’s  ready  queue 
only  when  it  runs  out  of  ready  threads  in  its  own  queue.  Since 
thread  creation  and  scheduling  are  typically  local  operations, 
they  incur  low  overhead  and  contention.  Further,  threads  close 
together  in  the  computation  graph  are  often  scheduled  on  the 
same  processor,  resulting  in  good  locality.  Several  systems 
have  used  work  stealing  to  provide  high  performance  [1 1,  17, 
18,  20,  26,  39,  42,  44].  When  each  processor  treats  its  own 


ready  queue  as  a  LIFO  stack  (that  is,  adds  or  removes  threads 
from  the  top  of  the  stack)  and  steals  from  the  bottom  of  another 
processor’s  stack,  the  scheduler  successfully  throttles  the  ex¬ 
cess  parallelism  [8,  39, 41, 44].  For  fully  strict  computations, 
such  a  mechanism  was  proved  to  require  p  •  Si  space  on  p 
processors,  where  Si  is  the  serial,  depth-first  space  require¬ 
ment  [9].  A  computation  with  W  work  (total  number  of  oper¬ 
ations)  and  D  depth  (length  of  the  critical  path)  was  shown  to 
require  W/p+ O(D)  time  on p  processors  [9].  We  will  hence¬ 
forth  refer  to  such  schedulers  as  work-stealing  schedulers. 

Recent  work  [6,  34]  has  resulted  in  depth-first  schedul¬ 
ing  algorithms  that  require  Si  +  0(p  •  D)  space  for  nested- 
parallel  computations  with  depth  D.  For  programs  that  have 
a  low  depth  (a  high  degree  of  parallelism),  such  as  all  pro¬ 
grams  in  the  class  NC  [14],  the  space  bound  of  Si  +  0(p  • 
D)  is  asymptotically  lower  than  the  work  stealing  bound  of 
p  •  Si .  Further,  the  depth-first  approach  allows  a  more  gen¬ 
eral  memory  allocation  model  compared  to  the  stack-based  al¬ 
locations  assumed  in  space-efficient  work  stealing  [6].  The 
depth-first  approach  has  been  extended  to  handle  computa¬ 
tions  with  futures  [39]  or  I-structures  [16],  resulting  in  similar 
space  bounds  [4].  Experiments  showed  that  an  asynchronous, 
depth-first  scheduler  often  results  in  lower  space  requirement 
in  practice,  compared  to  a  work-stealing  scheduler  [34].  How¬ 
ever,  since  depth-first  schedulers  use  a  globally  ordered  queue, 
they  do  not  provide  some  of  the  practical  advantages  enjoyed 
by  work-stealing  schedulers.  When  the  threads  expressed  by 
the  user  are  fine  grained,  the  performance  may  suffer  due  to 
poor  locality  and  high  scheduling  contention  ( i.e .,  contention 
over  shared  data  structures  while  scheduling)  [35].  Therefore, 
even  if  basic  thread  operations  are  cheap,  the  threads  have  to 
be  coarsened  for  depth-first  schedulers  to  provide  good  perfor¬ 
mance  in  practice. 

In  this  paper,  we  present  a  new  scheduling  algorithm  for 
implementing  multithreaded  languages  on  shared  memory  ma¬ 
chines.  The  algorithm,  called  DFDeques \  provides  a  compro¬ 
mise  between  previous  work-stealing  and  depth-first  sched¬ 
ulers.  Ready  threads  in  DFDeques  are  organized  in  multiple 
ready  queues,  that  are  globally  ordered  as  in  depth-first  sched¬ 
ulers.  The  ready  queues  are  treated  as  LIFO  stacks  similar  to 
previous  work-stealing  schedulers.  A  processor  steals  from 
a  ready  queue  chosen  randomly  from  a  set  of  high-priority 
queues.  For  nested-parallel  (or  fully  strict)  computations,  our 
algorithm  guarantees  an  expected  space  bound  of  Si  -f  O  ( K  *p  ■ 
D).  Here,  K  is  a  user-adjustable  runtime  parameter  called  the 
memory  threshold ,  which  specifies  the  net  amount  of  memory 
a  processor  may  allocate  between  consecutive  steals.  Since  K 
is  typically  fixed  to  be  a  small,  constant  amount  of  memory, 
the  space  bound  reduces  to  Si  -f  Q(D  •  p),  as  with  depth-first 
schedulers.  For  a  simplistic  cost  model,  we  show  that  the  ex¬ 
pected  running  time  is  0(W/p  *f  D)  on  p  processors2. 

We  refer  to  the  total  number  of  instructions  executed  in  a 
thread  as  the  thread’s  granularity .  We  also  (informally)  de¬ 
fine  scheduling  granularity  to  be  the  average  number  of  in¬ 
structions  executed  consecutively  on  a  single  processor,  from 
threads  close  together  in  the  computation  graph.  Thus,  a  larger 
scheduling  granularity  typically  implies  better  locality  and 

1  DFDeques  stands  for  “depth-first  deques”. 

2When  the  scheduler  in  DFDeques  is  parallelized,  the  costs  of  all  scheduling 
operations  can  be  accounted  for  with  a  more  realistic  model  [33].  Then,  in  the 
expected  case,  the  parallel  computation  can  be  executed  using  5 1  +  0(D  ■  p  ■ 
log  p )  space  and  O  ( W (p  -f  D  ■  log  p )  time  (including  scheduling  overheads). 
However,  for  brevity,  we  omit  a  description  and  analysis  of  such  a  parallelized 
scheduler. 
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lower  scheduling  contention.  In  the  DFDeques  scheduler,  when 
a  processor  finds  its  ready  queue  empty,  it  steals  a  thread  from 
the  bottom  of  another  ready  queue.  This  thread  is  typically 
the  coarsest  thread  in  the  queue,  resulting  in  a  larger  schedul¬ 
ing  granularity  compared  to  depth  first  schedulers.  Although 
we  do  not  analytically  prove  this  claim,  we  present  experi¬ 
mental  and  simulation  results  to  verify  it.  Adjusting  the  mem¬ 
ory  threshold  K  in  the  DFDeques  algorithm  provides  a  user- 
controllable  trade-off  between  scheduling  granularity  and 
space  requirement. 

Posix  threads  or  Pthreads  have  recently  become  a  popu¬ 
lar  standard  for  shared  memory  parallel  programming.  We 
therefore  added  the  DFDeques  scheduling  algorithm  to  a  na¬ 
tive,  user-level  Pthreads  library  [43].  Despite  being  one  of 
the  fastest  user-level  implementations  of  Pthreads  today,  the  li¬ 
brary’s  scheduler  does  not  efficiently  support  fine-grained,  dy¬ 
namic  threads.  In  previous  work  [35],  we  showed  how  its  per¬ 
formance  can  be  improved  using  a  space-efficient  depth-first 
scheduler.  In  this  paper,  we  compare  the  space  and  time  per¬ 
formance  of  the  new  DFDeques  scheduler  with  the  library’s 
original  scheduler  (which  uses  a  FIFO  scheduling  queue),  and 
with  our  previous  implementation  of  a  depth-first  scheduler. 
To  perform  the  experimental  comparison,  we  used  7  parallel 
benchmarks  written  with  a  large  number  of  dynamically  cre¬ 
ated  Pthreads.  As  shown  in  Figure  1,  the  new  DFDeques 
scheduler  results  in  better  locality  and  higher  speedups  com¬ 
pared  to  both  the  depth-first  scheduler  and  the  FIFO  scheduler. 

Ideally,  we  would  also  like  to  compare  our  Pthreads -based 
implementation  of  DFDeques  with  a  space-efficient  work-steal¬ 
ing  scheduler  ( e.g .,  the  scheduler  used  in  Cilk  [8]).  However, 
supporting  the  general  Pthreads  functionality  with  an  exist¬ 
ing  space-efficient  work-stealing  scheduler  [8]  would  require 
significant  modifications  to  both  the  scheduling  algorithm  and 
the  Pthreads  implementation3.  Therefore,  to  compare  our  new 
scheduler  to  this  work-stealing  scheduler,  we  instead  built  a 
simple  simulator  that  implements  synthetic,  fully-strict  bench¬ 
marks.  Our  simulation  results  indicate  that  by  adjusting  the 
memory  threshold,  our  new  scheduler  covers  a  wide  range 
of  space  requirements  and  scheduling  granularities.  At  one 
extreme  it  performs  similar  to  a  depth-first  scheduler,  with 
low  space  requirement  and  small  scheduling  granularity.  At 
the  other  extreme,  it  behaves  exactly  like  the  work-stealing 
scheduler,  with  higher  space  requirement  and  larger  schedul¬ 
ing  granularity. 

2  Background  and  Previous  Work 

A  parallel  computation  can  be  represented  by  a  directed  acyclic 
graph;  we  will  refer  to  such  a  computation  graph  as  a  dag  in 
the  remainder  of  this  paper.  Each  node  in  the  dag  represents 
a  single  action  in  a  thread;  an  action  is  a  unit  of  work  that  re¬ 
quires  a  single  timestep  to  be  executed.  Each  edge  in  the  dag 
represents  a  dependence  between  two  actions.  Figure  2  shows 
such  an  example  dag  for  a  simple  parallel  computation.  The 
dashed,  right-to-left /orA:  edges  in  the  figure  represent  the  fork 
of  a  child  thread.  The  dashed,  left-to-right  synch  edges  repre¬ 
sent  a  join  between  a  parent  and  child  thread,  while  each  solid 
vertical  continue  edge  represents  a  sequential  dependence  be¬ 
tween  a  pair  of  consecutive  actions  within  a  single  thread.  For 

3Even  fully  strict  Pthreads  benchmarks  cannot  be  executed  using  such  a 
work- stealing  scheduler  in  the  existing  Solaris  Pthreads  implementation,  be¬ 
cause  the  Pthreads  implementation  itself  makes  extensive  use  of  blocking  syn¬ 
chronization  primitives  such  as  Pthread  mutexes  and  condition  variables. 
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Figure  2:  An  example  dag  for  a  parallel  computation;  the 
threads  are  shown  shaded.  Each  right-to-left  edge  represents 
a  fork,  and  each  left-to-right  edge  represents  a  synchroniza¬ 
tion  of  a  child  thread  with  its  parent.  Vertical  edges  represent 
sequential  dependencies  within  threads,  to  is  the  initial  (root) 
thread,  which  forks  child  threads  1 i ,  £2»  £3,  and  £4  in  that  order. 
Child  threads  may  fork  threads  themselves;  e.g.,  £2  forks  £5. 

computations  with  dynamic  parallelism,  the  dag  is  revealed 
and  scheduled  onto  the  processors  at  runtime. 

2.1  Scheduling  for  locality 

Detection  of  data  accesses  or  data  sharing  patterns  among 
threads  in  a  dynamic  and  irregular  computation  is  often  be¬ 
yond  the  scope  of  the  compiler.  Further,  today’s  hardware- 
coherent  SMPs  do  not  allow  explicit,  software -controlled  place¬ 
ment  of  data  im processor  caches;  therefore,  owner-compute 
optimizations  for  locality  that  are  popular  on  distributed  mem¬ 
ory  machines  typically  do  not  apply  to  SMPs.  However,  in 
many  parallel  programs  with  fine-grained  threads,  the  threads 
close  together  in  the  computation’s  dag  often  access  the  same 
data.  For  example,  in  a  divide-and-conquer  computation  (such 
as  quicksort)  where  a  new  thread  is  forked  for  each  recur¬ 
sive  call,  a  thread  shares  data  with  all  its  descendent  threads. 
Therefore,  many  parallel  implementations  of  lightweight 
threads  use  per-processor  data  structures  to  store  ready 
threads  [17,  20,  24,  25,  39,  42, 44].  Threads  created  on  a  pro¬ 
cessor  are  stored  locally  and  moved  only  when  required  to  bal¬ 
ance  the  load.  This  technique  effectively  increases  scheduling 
granularity,  and  therefore  provides  good  locality  [7]  and  low 
scheduling  contention. 

Another  approach  for  obtaining  good  locality  is  to  allow 
the  user  to  supply  hints  to  the  scheduler  regarding  the  data  ac¬ 
cess  patterns  of  die  threads  [12,  28,  37,  45].  However,  such 
hints  can  be  cumbersome  for  the  user  to  provide  in  complex 
programs,  and  are  often  specific  to  a  certain  language  or  li¬ 
brary  interface.  Therefore,  our  DFDeques  algorithm  instead 
uses  the  heuristic  of  scheduling  threads  close  in  the  dag  on  the 
same  processor  to  obtain  good  locality. 

2.2  Scheduling  for  space-efficiency 

The  thread  scheduler  plays  a  significant  role  in  controlling  the 
amount  of  active  parallelism  in  a  fine-grained  computation. 
For  example,  consider  a  single-processor  execution  of  the  dag 
in  Figure  2.  If  the  scheduler  uses  a  LIFO  stack  to  store  ready 
threads,  and  a  child  thread  preempts  its  parent  as  soon  as  it 
is  forked,  the  nodes  are  executed  in  a  (left-to-right)  depth-first 
order,  resulting  in  at  most  5  simultaneously  active  threads.  In 
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Figure  1 :  Summary  of  experimental  results  with  the  Solaris  Pthreads  library.  For  each  scheduling  technique,  we  show  the  maximum 
number  of  simultaneously  active  threads  (each  of  which  requires  min.  8kB  stack  space),  the  L2  cache  misses  rates  (%),  and  the 
speedups  on  an  8-processor  Enterprise  5000  SMP.  “FIFO”  is  the  original  Pthreads  scheduler,  “ADF”  is  an  asynchronous,  depth-first 
scheduler  [35],  and  “DFD”  is  our  new  DFDeques  scheduler. 


contrast,  if  the  scheduler  uses  a  FIFO  queue,  the  threads  are 
executed  in  a  breadth-first  order,  resulting  in  all  16  threads  be¬ 
ing  simultaneously  active.  Systems  that  support  fine-grained, 
dynamic  parallelism  can  suffer  from  such  a  creation  of  excess 
parallelism. 

Initial  attempts  to  control  the  active  parallelism  were  based 
on  heuristics  [3,  16,  31, 40,  39],  which  included  work  stealing 
techniques  [31,  39].  Heuristic  attempts  work  well  for  some 
programs,  but  do  not  guarantee  an  upper  bound  on  the  space 
requirements  of  a  program.  More  recently,  two  different  tech¬ 
niques  have  been  shown  to  be  provably  space-efficient:  work¬ 
stealing  schedulers,  and  depth-first  schedulers. 

In  addition  to  being  space  efficient  [8,  41],  work  stealing 
can  often  result  in  large  scheduling  granularities,  by  allowing 
idle  processors  to  steal  threads  higher  up  in  the  dag  ( e.g .,  see 
Figure  3(a)).  Several  systems  use  such  an  approach  to  obtain 
good  parallel  performance  [8, 17,  26,  39, 44]. 

Depth-first  schedulers  guarantee  an  upper  bound  on  the 
space  requirement  of  a  parallel  computation  by  prioritizing 
its  threads  according  to  their  serial,  depth-first  execution  or¬ 
der  [6,  34].  In  a  recent  paper  [35],  we  showed  that  the  per¬ 
formance  of  a  commercial  Pthreads  implementation  could  be 
improved  for  predominantly  nested-parallel  benchmarks  using 
a  depth-first  scheduler.  However,  depth-first  schedulers  can  re¬ 
sult  in  high  scheduling  contention  and  poor  locality  when  the 
threads  in  the  program  are  very  fine  grained  [34,  35]  (see  Fig¬ 
ure  3). 

The  next  section  describes  a  new  scheduling  algorithm  that 
combines  ideas  from  the  above  two  space-efficient  approaches. 


3  The  DFDeques  Scheduling  Algorithm 

We  first  describe  the  programming  model  for  the  multithreaded 
computations  that  are  executed  by  the  DFDeques  scheduling 
algorithm.  We  then  list  the  data  structures  used  by  the  sched¬ 
uler,  followed  by  a  description  of  the  DFDeques  scheduling 
algorithm. 

3.1  Programming  model 

As  with  depth-first  schedulers,  our  scheduling  algorithm  ap¬ 
plies  to  pure,  nested-parallel  computations,  which  can  be  mod¬ 
eled  by  series-parallel  dags  [6].  Nested-parallel  computations 
are  equivalent  to  the  subset  of  fully  strict  computations  sup¬ 
ported  by  Cilk’s  space-efficient  work-stealing  scheduler  [8, 


Figure  3:  Possible  mappings  of  threads  of  the  dag  in  Figure  2 
onto  processors  Po, .  • . ,  Ps  by  (a)  work-stealing  schedulers, 
and  (b)  depth-first  schedulers.  If,  say,  the  ith  thread  (going 
from  left  to  right)  accesses  the  ith  block  or  element  of  an  ar¬ 
ray,  then  scheduling  consecutive  threads  on  the  same  processor 
provides  better  cache  locality  and  lower  scheduling  overheads. 

20].  Nested  parallelism  can  be  used  to  express  a  large  variety 
of  parallel  programs,  including  recursive,  divide-and-conquer 
programs  and  programs  with  nested-parallel  loops.  Our  model 
assumes  binary  forks  and  joins;  the  example  dag  in  Figure  2 
represents  such  a  nested-parallel  computation. 

Although  we  describe  and  analyze  our  algorithm  for  nested- 
parallel  computations,  in  practice  it  can  be  extended  to  exe¬ 
cute  programs  with  other  styles  of  parallelism.  For  example, 
the  Pthreads  scheduler  described  in  Section  5  supports  com¬ 
putations  with  arbitrary  synchronizations,  such  as  mutexes  and 
condition  variables.  However,  our  analytical  space  bound  does 
not  apply  to  such  general  computations. 

A  thread  is  active  if  it  has  been  created  but  has  not  yet  ter- 
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Figure  4:  The  serial,  depth-first  execution  order  for  a  nested- 
parallel  computation.  The  ith  node  executed  is  labelled  i  in 
this  dag;  the  lower  the  label  of  a  thread’s  current  node  (action), 
the  higher  is  its  priority  in  DFDeques. 

minated.  A  parent  thread  waiting  to  synchronize  with  a  child 
thread  is  said  to  be  suspended.  We  say  an  active  thread  is 
ready  to  be  scheduled  if  it  is  not  suspended,  and  is  not  cur¬ 
rently  being  executed  by  a  processor.  Each  action  in  a  thread 
may  allocate  an  arbitrary  amount  of  space  on  the  thread  stack, 
or  on  the  shared  heap. 

Every  nested-parallel  computation  has  a  natural  serial  exe¬ 
cution  order,  which  we  call  its  depth-first  order.  When  a  child 
thread  is  forked,  it  is  executed  before  its  parent  in  a  depth- 
first  execution  ( e.g .,  see  Figure  4).  Thus,  the  depth-first  or¬ 
der  is  identical  to  the  unique  serial  execution  order  for  any 
stack-based  language  (such  as  C),  when  the  thread  forks  are 
replaced  by  simple  function  calls.  Algorithm  DFDeques  pri¬ 
oritizes  ready  threads  according  to  their  serial,  depth-first  ex¬ 
ecution  order;  an  earlier  serial  execution  order  translates  to  a 
higher  priority. 

3.2  Scheduling  data  structures 

Although  the  dag  for  a  computation  is  revealed  as  the  execu¬ 
tion  proceeds,  dynamically  maintaining  the  relative  thread  pri¬ 
orities  for  nested-parallel  computations  is  straightforward  [6] 
and  inexpensive  in  practice  [34].  In  algorithm  DFDeques , 
the  ready  threads  are  stored  in  doubly-ended  queues  or  de¬ 
ques  [15].  Each  of  these  deques  supports  popping  from  and 
pushing  onto  its  top,  as  well  as  popping  from  the  bottom  of  the 
deque.  At  any  time  during  the  execution,  a  processor  owns  at 
most  one  deque,  and  executes  threads  from  it.  A  single  deque 
has  at  most  one  owner  at  any  time.  However,  unlike  traditional 
work  stealing,  the  number  of  deques  may  exceed  the  number 
of  processors.  All  the  deques  are  arranged  in  a  global  list  71  of 
deques.  The  list  supports  adding  of  a  new  deque  to  the  imme¬ 
diate  right  of  another  deque,  deletion  of  a  deque,  and  finding 
the  mtK dequeue  from  the  left  end  of  71. 

3.3  The  DFDeques  scheduling  algorithm 

The  processors  execute  the  code  in  Figure  5  for  algorithm 
DFDeques(K);  here  K  is  the  memory  threshold ,  a  user-defined 
runtime  parameter.  Each  processor  treats  its  own  deque  as 
a  regular  LIFO  stack,  and  is  assigned  a  memory  quota  of  K 
bytes  from  which  to  allocate  heap  and  stack  data.  This  mem¬ 
ory  threshold  I<  is  equivalent  to  the  per-thread  memory  quota 
in  depth-first  schedulers  [34];  however,  in  algorithm  DFDe¬ 
ques ,  the  memory  quota  of  K  bytes  can  be  used  by  a  proces¬ 
sor  to  execute  multiple  threads  from  one  deque.  A  thread  exe¬ 
cutes  without  preemption  on  a  processor  until  it  forks  a  child 
thread,  suspends  waiting  for  a  child  to  terminate,  terminates, 
or  the  processor  runs  out  of  its  memory  quota.  If  a  terminating 
thread  wakes  up  its  previously  suspended  parent,  the  processor 


while  (3  threads) 
if  (currS  =  null)  currS  :=  steal(); 
if  (currT  =  NULL)  currT  :=  popJfrom_top(curr5); 
execute  currT  until  it  forks,  suspends,  terminates, 
or  memory  quota  exhausted: 
case  (fork): 

push_to_top(cunT,  currS); 
currT  :=  newly  forked  child  thread; 
case  (suspend): 
currT  :=  null; 

case  (memory  quota  exhausted): 
push-to_top (currT,  currS); 
currT  :=  null; 

currS  :=  NULL;  /*  give  up  stack  */ 
case  (terminate): 

if  currT  wakes  up  suspended  parent  Tf 
currT  :=  Tf; 
else  currT:-  null; 

if  ((is -empty  (currS))  and  (currT-  NULL)) 
currS  :=  null;  /*  give  up  and  delete  stack  */ 

endwhile 

procedure  stealQ: 
set  memory  quota  to  K; 
while  (TRUE  ) 

m  random  number  in  [1 ...  p]; 

S  :=  mth deque  in  71; 

T  :=  popjfrom_bot(S); 
if(7y  NULL) 

create  new  deque  S '  containing  T 
and  become  its  owner; 
place  S'  to  immediate  right  of  S  in  71; 

return  S'; 

Figure  5:  Pseudocode  for  the  DFDeques(I\)  scheduling  algo¬ 
rithm  executed  by  each  of  the  p  processors;  K  is  the  memory 
threshold.  currS  is  the  processor’s  current  deque.  currT  is 
the  current  thread  being  executed;  changing  its  value  denotes 
a  context  switch.  Memory  management  of  the  deques  is  not 
shown  here  for  brevity. 


starts  executing  the  parent  next;  for  nested  parallel  computa¬ 
tions,  we  can  show  that  the  processor’s  deque  must  be  empty  at 
this  stage  [33].  When  an  idle  processor  finds  its  deque  empty, 
it  deletes  the  deque.  When  a  processor  deletes  its  deque,  or 
when  it  gives  up  ownership  of  its  deque  due  to  exhaustion  of 
its  memory  quota,  it  uses  the  steal  ( )  procedure  to  obtain 
a  new  deque.  Every  invocation  of  steal  ( )  resets  the  pro¬ 
cessor’s  memory  quota  to  K  bytes.  We  call  an  iteration  of  the 
loop  in  the  steal  ( )  procedure  a  steal  attempt . 

A  processor  executes  a  steal  attempt  by  picking  a  random 
number  m  between  1  and  p,  where  p  is  the  number  of  proces¬ 
sors.  It  then  tries  to  steal  the  bottom  thread  from  the  mth 
deque  (starting  from  the  left  end)  in  71.  A  steal  attempt 
may  fail  (that  is,  pop.f  rom_bot  ( )  returns  NULL)  if  two  or 
more  processors  target  the  same  deque  (see  Section  4.1),  or 
if  the  deque  is  empty  or  non-existent.  If  the  steal  attempt  is 
successful  (pop.f  ronubot  ( )  returns  a  thread),  the  stealing 
processor  creates  a  new  deque  for  itself,  places  it  to  the  imme¬ 
diate  right  of  the  target  deque,  and  starts  executing  the  stolen 
thread.  Otherwise,  it  repeats  the  steal  attempt.  When  a  proces¬ 
sor  steals  the  last  thread  from  a  deque  not  currently  associated 
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Figure  6:  The  list  71  of  deques  maintained  in  the  system  by  al¬ 
gorithm  DFDeques .  Each  deque  may  have  one  (or  no)  owner 
processor.  The  dotted  line  traces  the  decreasing  order  of  prior¬ 
ities  of  the  threads  in  the  system;  thus  ta  in  this  figure  has  the 
highest  priority,  while  tb  has  the  lowest  priority. 


with  (owned  by)  any  processor,  it  deletes  the  deque. 

If  a  thread  contains  an  action  that  performs  a  memory  al¬ 
location  of  m  units  such  that  m  >  K  (where  K  is  the  mem¬ 
ory  threshold),  then  [m/K\  dummy  threads  must  be  forked 
in  a  binary  tree  of  depth  0(log  m/K )  before  the  allocation4. 
We  do  not  show  this  extension  in  Figure  5  for  brevity.  Each 
dummy  thread  executes  a  no-op.  However,  processors  must 
give  up  their  deques  and  perform  a  steal  every  time  they  exe¬ 
cute  a  dummy  thread.  Once  all  the  dummy  threads  have  been 
executed,  a  processor  may  proceed  with  the  memory  alloca¬ 
tion.  This  transformation  takes  place  at  runtime.  The  addition 
of  dummy  threads  effectively  delays  large  allocations  of  space, 
so  that  higher  priority  threads  may  be  scheduled  instead.  In 
practice,  K  is  typically  set  to  a  few  thousand  bytes,  so  that  the 
runtime  overhead  due  to  the  dummy  threads  is  negligible  (e.g., 
see  Section  5). 

We  now  prove  a  lemma  regarding  the  order  of  threads  in 
71  maintained  by  algorithm  DFDeques;  this  order  is  shown 
pictorially  in  Figure  6. 

Lemma  3.1  Algorithm  DFDeques  maintains  the  following 
ordering  of  threads  in  the  system. 

1.  Threads  in  each  deque  are  in  decreasing  order  of  priorities 
from  top  to  bottom . 

2.  A  thread  currently  executing  on  a  processor  has  higher  pri¬ 
ority  than  all  other  threads  on  the  processor's  deque . 

3 .  The  threads  in  any  given  deque  have  higher  priorities  than 
threads  in  all  the  deques  to  its  right  in  71. 

Proof.  By  induction  on  the  timesteps.  The  base  case  is  the 
start  of  the  execution,  when  the  root  thread  is  the  only  thread 
in  the  system.  Let  the  three  properties  be  true  at  the  start  of 
any  subsequent  timestep.  Any  of  the  following  events  may 
take  place  on  each  processor  during  the  timestep;  we  will  show 
that  the  properties  continue  to  hold  at  the  end  of  the  timestep. 

When  a  thread  forks  a  child  thread,  the  parent  is  added  to 
the  top  of  the  processor’s  deque,  and  the  child  starts  execution. 
Since  the  parent  has  a  higher  priority  that  all  other  threads  in 
the  processor’s  deque  (by  induction),  and  since  the  child  thread 
has  a  higher  priority  (earlier  depth-first  execution  order)  than 
its  parent,  properties  (1)  and  (2)  continue  to  hold.  Further, 

4This  transformation  differs  slightly  from  depth-first  schedulers  [6,  34], 
which  allow  dummy  threads  to  be  forked  in  a  multi-way  fork  of  constant  depth. 


since  the  child  now  has  the  priority  immediately  higher  than 
its  parent,  property  (3)  holds. 

When  a  thread  T  terminates,  the  processor  checks  if  T 
has  reactivated  a  suspended  parent  thread  Tp.  In  this  case, 
it  starts  executing  Tp .  Since  the  computation  is  nested  paral¬ 
lel,  the  processor’s  deque  must  now  be  empty  (since  the  parent 
Tp  must  have  been  stolen  at  some  earlier  point  and  then  sus¬ 
pended).  Therefore,  all  3  conditions  continue  to  hold.  If  T  did 
not  wake  up  its  parent,  the  processor  picks  the  next  thread  from 
the  top  its  deque.  If  the  deque  is  empty,  it  deletes  the  deque 
and  performs  a  steal.  Therefore  all  three  properties  continue 
to  hold  in  these  cases  too. 

When  a  thread  suspends  or  is  preempted  due  to  exhaustion 
of  the  processor’s  memory  quota,  it  is  put  back  on  the  top  of  its 
deque,  and  the  deque  retains  its  position  in  71.  Thus  all  three 
properties  continue  to  hold. 

When  a  processor  steals  the  bottom  thread  from  another 
deque,  it  adds  the  new  deque  to  the  right  of  the  target  deque. 
Since  the  stolen  thread  had  the  lowest  priority  in  the  target 
deque,  the  properties  continue  to  hold.  Similarly,  removal  of  a 
thread  from  the  target  deque  does  not  affect  the  validity  of  the 
three  properties  for  the  target  deque.  A  thread  may  be  stolen 
from  a  processor’s  deque  while  one  of  the  above  events  takes 
place  on  the  processor  itself;  this  does  not  affect  the  validity 
of  our  argument. 

Finally,  deletion  of  one  or  more  deques  from  71  does  not 
affect  the  three  properties.  ■ 

Work  stealing  as  a  special  case  of  algorithm  DFDeques. 

Consider  the  case  when  we  set  the  memory  threshold  K  = 
oo.  Then,  for  nested-parallel  computations,  algorithm  DFD¬ 
eques  (oo)  produces  a  schedule  identical  to  the  one  produced 
by  the  provably-efficient  work-stealing  scheduler  “WS”  [9]. 
The  processors  in  DFDequesoo  never  give  up  a  deque  due  to 
exhaustion  of  their  memory  quota,  and  therefore,  as  with  the 
work  stealer,  there  are  never  more  than  p  deques  in  the  sys¬ 
tem.  Further,  in  both  algorithms,  when  a  processor’s  deque 
becomes  empty,  it  picks  another  processor  uniformly  at  ran¬ 
dom,  and  steals  the  bottommost  thread  from  that  processor’s 
deque.  Similarly,  for  nested  parallel  computations,  the  rule  for 
waking  up  a  suspended  parent  in  DFDeques(o o)  is  equivalent 
to  the  corresponding  rule  in  WS5.  Of  course,  the  schedules  are 
identical  assuming  the  same  cost  model  for  both  algorithms; 
the  model  could  be  either  the  atomic-access  model  used  to  an¬ 
alyze  WS  [9],  or  our  cost  model  from  Section  4. 1 . 

4  Analysis  of  Time  and  Space  Bounds  Us¬ 
ing  Algorithm  DFDeques 

We  now  prove  the  space  and  time  bounds  for  nested-parallel 
computations. 

4.1  Cost  model 

We  define  the  total  number  of  unit  actions  in  a  parallel  com¬ 
putation  (or  the  number  of  nodes  in  its  dag)  as  its  work  W . 
Further,  let  D  be  the  depth  of  the  computation,  that  is,  the 
length  of  the  longest  path  in  its  dag.  For  example,  the  com¬ 
putation  represented  in  Figure  4  has  work  W  =  11  and  depth 

5 In  WS,  the  reawakened  parent  is  placed  added  to  the  current  processor’s 
deque  (which  is  empty);  for  nested  parallel  computations,  the  child  must  termi¬ 
nate  at  this  point,  and  therefore,  the  next  thread  executed  by  the  processor  is  the 
parent  thread. 
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D  =  6.  We  assume  that  an  allocation  of  m  bytes  of  memory 
(for  any  m  >  0)  has  a  depth  of  0(log  m)  units6. 

For  this  analysis,  we  assume  that  timesteps  (clock  cycles) 
are  synchronized  across  all  the  processors.  If  multiple  proces¬ 
sors  target  a  non-empty  deque  in  a  single  timestep,  we  assume 
that  one  of  them  succeeds  in  the  steal,  while  all  die  others  fail 
in  that  timestep.  If  the  deque  targeted  by  one  or  more  steals  is 
empty,  all  of  those  steals  fail  in  a  single  timestep.  When  a  steal 
fails,  the  processor  attempts  another  steal  in  the  next  timestep. 
When  a  steal  succeeds,  the  processor  inserts  the  newly  cre¬ 
ated  deque  into  71.  and  executes  the  first  action  from  the  stolen 
thread  in  the  same  timestep.  At  the  end  of  a  timestep,  if  a 
processor’s  current  thread  terminates  or  suspends,  and  it  finds 
its  deque  to  be  empty,  it  immediately  deletes  its  deque  in  that 
timestep.  Similarly,  when  a  processor  steals  the  last  thread 
from  a  deque  not  currently  associated  with  any  processor,  it 
deletes  the  deque  in  that  timestep.  Thus,  at  the  start  of  a 
timestep,  if  a  deque  is  empty,  it  must  be  owned  by  a  processor 
that  is  busy  executing  a  thread. 

Our  cost  model  is  somewhat  simplistic,  because  it  ignores 
the  cost  of  maintaining  the  ordered  set  of  deques  71.  If  we  par¬ 
allelize  the  scheduling  tasks  of  inserting  and  deleting  deques 
in  71.  (by  performing  them  lazily),  we  can  account  for  all  their 
overheads  in  the  time  bound.  We  can  then  show  that  in  the 
expected  case,  the  computation  can  be  executed  in  0(W/p  + 
D  •  log  p)  time  and  Si  -f  0(p  log  p  •  D)  space  on  p  proces¬ 
sors,  including  the  scheduling  overheads  [33].  In  practice,  the 
insertions  and  deletions  of  deques  from  7Z  can  be  either  serial¬ 
ized  and  protected  by  a  lock  (for  small  p),  or  performed  lazily 
in  parallel  (for  large  p). 

4.2  Space  bound 

We  now  analyze  the  space  bound  for  a  parallel  computation 
executed  by  algorithm  DFDeques.  The  analysis  uses  several 
ideas  from  previous  work  [2,  6,  34]. 

Let  G  be  the  dag  that  represents  the  parallel  computation 
being  executed.  Depending  on  the  resulting  parallel  schedule, 
we  classify  its  nodes  (actions)  into  one  of  two  types:  heavy  and 
light.  Every  time  a  processor  performs  a  steal,  the  first  node 
it  executes  from  the  stolen  thread  is  called  a  heavy  action.  All 
remaining  nodes  in  G  are  labelled  as  light. 

We  first  assume  that  every  node  allocates  at  most  I<  space; 
we  will  relax  this  assumption  in  the  end.  Recall  that  a  proces¬ 
sor  may  allocate  at  most  K  space  between  consecutive  steals; 
thus,  it  may  allocate  at  most  K  space  for  every  heavy  node  it 
executes.  Therefore,  we  can  attribute  all  the  memory  allocated 
by  light  nodes  to  the  last  heavy  node  that  precedes  them.  This 
results  in  a  conservative  view  of  the  total  space  allocation. 

Let  sp  =  Vj. , . . . ,  Vr  be  the  parallel  schedule  of  the  dag 
generated  by  algorithm  DFDeques(K).  Here  Vt  is  the  set  of 
nodes  that  are  executed  at  timestep  i.  Let  s\  be  the  serial, 
depth-first  schedule  or  the  lD¥-schedule  for  the  same  dag; 
e.g .,  the  nodes  in  Figure  4  are  numbered  according  to  their 
order  of  execution  in  a  1  DF-schedule. 

We  now  view  an  intermediate  snapshot  of  the  parallel  sched¬ 
ule  sp.  At  any  timestep  1  <  j  <  r  during  the  execution  of 
sp,  all  the  nodes  executed  so  far  form  a  prefix  of  sp.  This  pre¬ 
fix  of  sp  is  defined  as  <rp  =  \JJt=1  Vi .  Let  <r\  be  the  longest 
prefix  of  s\  containing  only  nodes  in  crp,  that  is,  <r\  C  <tp. 

6This  is  a  reasonable  assumption  in  systems  with  binary  forks  that  zero  out 
the  memory  as  soon  as  it  is  allocated.  The  zeroing  then  requires  a  minimum 
depth  of  ©(log  m);  it  can  be  performed  in  parallel  by  forking  a  tree  of  height 

0(logm), 


Then  the  prefix  <ti  is  called  the  corresponding  serial  prefix  of 
<tp.  The  nodes  in  the  set  op  —  are  called  premature  nodes, 
since  they  have  been  executed  out  of  order  with  respect  to  the 
1  DF-schedule  s\ .  All  other  nodes  in  <rv,  that  is,  the  set  oi ,  are 
called  non-premature.  For  example,  Figure  7  shows  a  simple 
dag  with  a  parallel  prefix  op  for  an  arbitrary  p-schedule  sp ,  its 
corresponding  serial  prefix  &i ,  and  a  possible  classification  of 
nodes  as  heavy  or  light. 

A  ready  thread  being  present  in  a  deque  is  equivalent  to 
its  first  unexecuted  node  (action)  being  in  the  deque,  and  we 
will  use  the  two  phrases  interchangeably.  Given  a  p-schedule 
sp  of  a  dag  G  generated  by  algorithm  DFDeques ,  we  can  find 
a  unique  last  parent  for  every  node  in  G  (except  for  the  root 
node)  as  follows.  The  last  parent  of  a  node  u  in  G  is  defined 
as  the  last  of  u's  parent  nodes  to  be  executed  in  the  sched¬ 
ule  sp.  If  two  or  more  parent  nodes  of  u  were  the  last  to  be 
executed,  the  processor  executing  one  of  them  continues  exe¬ 
cution  of  u’s  thread.  We  label  the  unique  parent  of  u  executed 
by  this  processor  as  its  last  parent.  This  processor  may  have 
to  preempt  n's  thread  without  executing  u  if  it  runs  out  of  its 
memory  quota;  in  this  case,  it  puts  u’s  thread  on  to  its  deque 
and  then  gives  up  the  deque. 

Consider  the  prefix  op  of  the  parallel  schedule  sp  after  the 
first  j  timesteps,  for  any  1  <  j  <  r.  Let  v  be  the  last  non- 
premature  node  (i.e.,  the  last  node  from  ai)  to  be  executed 
during  the  first  j  timesteps  of  sp.  If  more  than  one  such  node 
exist,  let  v  be  any  one  of  them.  Let  P  be  a  set  of  nodes  in  the 
dag  constructed  as  follows:  P  is  initialized  to  {u};  for  every 
node  u  in  P,  the  last  parent  of  u  is  added  to  P.  Since  the 
root  is  the  only  node  at  depth  1,  it  must  be  in  P,  and  thus,  P 
contains  exactly  all  the  nodes  along  a  particular  path  from  the 
root  to  v.  Further,  since  v  is  non-premature,  all  the  nodes  in  P 
are  non-premature. 

Let  ut  be  the  node  in  P  at  depth  i ;  then  u\  is  the  root, 
and  us  is  the  node  v,  where  S  is  the  depth  of  v.  Let  tt  be  the 
timestep  in  which  u,  is  executed;  then  ti  —  1  since  the  root 
is  executed  in  the  first  timestep.  For  i  =  2, . . . ,  <5  let  7t  be  the 
interval  {t,_i  +  and  let  h  —  {1}.  Let/*+i  = 

{ts  +  1, . . .  ,j}.  Since  <jp  consists  of  all  the  nodes  executed 
in  the  first  j  timesteps,  the  intervals  /i , . . . ,  Is+i  cover  the 
duration  of  execution  of  all  nodes  in  <yp. 

We  first  prove  the  following  lemma  regarding  the  nodes  in 
a  deque  below  any  of  the  nodes  u,  in  P. 

Lemma  4.1  For  any  1  <  i  <  S,  let  ut  be  the  node  in  P  at 
depth  i .  Then, 

1.  If  during  the  execution  u,  is  on  some  deque ,  then  every  node 

below  it  in  its  deque  is  the  right  child  of  some  node  in  P. 

2.  When  u,  is  executed  on  a  processor,  every  node  on  the  pro¬ 
cessor*  s  deque  must  be  the  right  child  of  some  node  in  P. 

Proof.  We  can  prove  this  lemma  to  be  true  for  any  u,  by 
induction  on  i.  The  base  case  is  the  root  node.  Initially  it  is 
the  only  node  in  its  deque,  and  gets  executed  before  any  new 
nodes  are  created.  Thus,  the  lemma  is  trivially  true.  Let  us 
assume  the  lemma  is  true  for  all  uJy  for  0  <  j  <  i.  We  must 
prove  that  it  is  true  for  ut + 1 . 

Since  u,  is  the  last  parent  of  tij+i,  u;+i  becomes  ready 
immediately  after  ut  is  executed  on  some  processor.  There  are 
two  possibilities: 

1.  Ut-f.  i  is  executed  immediately  following  m  on  that  proces¬ 
sor.  Property  (1)  hold  trivially  since  ut+ 1  is  never  put  on  a 
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Figure  7:  (a)  An  example  snapshot  of  a  parallel  schedule  for  a  simple  dag.  The  shaded  nodes  (the  set  of  nodes  in  <rp)  have  been 
executed,  while  the  blank  (white)  nodes  have  not.  Of  the  nodes  in  <yp,  the  black  nodes  form  the  corresponding  parallel  prefix  v\ , 
while  the  remaining  grey  nodes  are  premature,  (b)  A  possible  partitioning  of  nodes  in  ap  into  heavy  and  light  nodes.  Each  shaded 
region  denotes  the  set  of  nodes  executed  consecutively  in  depth-first  order  on  a  single  processor  (Pi ,  P2,  P3  or  P4)  between  steals. 
The  heavy  node  in  each  region  is  shown  shaded  black. 


deque.  If  the  deque  remains  unchanged  before  u*+i  is  exe¬ 
cuted,  property  (2)  holds  trivially  for  .  Otherwise,  the 
only  change  that  may  be  made  to  the  deque  is  the  addition 
of  the  right  child  of  ut  before  u*+i  is  executed,  if  m  was 
a  fork  with  1  as  its  left  child.  In  this  case  too,  property 
(2)  holds,  since  the  new  node  in  the  deque  is  right  child  of 
some  node  in  P. 

2.  u;+i  is  added  to  the  processor’s  deque  after  m  is  executed. 
This  may  happen  because  ut  was  a  fork  and  ut+i  was  its 
right  child  (see  Figure  8),  or  because  the  processor  exhausted 
its  memory  quota.  In  the  former  case,  since  + 1  is  the  right 
child  of  Ui ,  nothing  can  be  added  to  the  deque  before  uz+i . 
In  the  latter  case  (that  is,  the  memory  quota  is  exhausted  be¬ 
fore  u;+i  is  executed),  the  only  node  that  may  be  added  to 
the  deque  before  ui+ 1  is  the  right  child  of  ui ,  if  m  is  a  fork. 
This  does  not  violate  the  lemma.  Once  + 1  is  added  to  the 
deque,  it  may  either  get  executed  on  a  processor  when  it  be¬ 
comes  the  topmost  node  in  the  deque,  or  it  may  get  stolen. 
If  it  gets  executed  without  being  stolen,  properties  (1)  and 
(2)  hold,  since  no  new  nodes  can  be  added  below  Ut+i  in 
the  deque.  If  it  is  stolen,  the  processor  that  steals  and  ex¬ 
ecutes  it  has  an  empty  deque,  and  therefore  properties  (1) 
and  (2)  are  true,  and  continue  to  hold  until  1  has  been 
executed. 


Recall  that  heavy  nodes  are  a  property  of  the  parallel  schedule, 
while  premature  nodes  are  defined  relative  to  a  given  prefix  of 
the  parallel  schedule.  To  prove  the  space  bound,  we  first  bound 
the  number  of  heavy  premature  nodes  in  an  arbitrary  prefix  crp 
of  sp. 

Lemma  4.2  Let  ap  be  any  parallel  prefix  of  a  p-schedule  pro¬ 
duced  by  algorithm  DFDeques(K)  for  a  computation  with 
depth  D,  in  which  every  action  allocates  at  most  K  space. 
Then  the  expected  number  of  heavy  premature  nodes  in  <rp  is 
0(p  •  D).  Further,  for  any  e  >  0,  the  number  of  heavy  pre¬ 
mature  nodes  is  0(p  -  (D  ln(  1/e) ))  with  probability  at  least 
1  —  e. 

Proof.  Consider  the  start  of  any  interval  L  of  ap,  for  i  — 
1, . .  • ,  6  (we  will  look  at  the  last  interval  /*+ 1  separately).  By 


Lemma  3.1,  all  nodes  in  the  deques  to  the  left  of  uf  s  deque, 
and  all  nodes  above  ui  in  its  deque  are  non-premature.  Let  xt 
be  the  number  of  nodes  below  ut  in  its  deque.  Because  steals 
target  the  first  p  deques  in  It,  heavy  premature  nodes  can  be 
picked  in  any  timestep  from  at  most  p  deques.  Further,  every 
time  a  heavy  premature  node  is  picked,  the  deque  containing 
u{  must  also  be  a  candidate  deque  to  be  picked  as  a  target  for 
a  steal;  that  is,  ut  must  be  among  the  leftmost  p  deques.  Con¬ 
sider  only  the  timesteps  in  which  ut  is  among  the  leftmost  p 
deques;  we  will  refer  to  such  timesteps  as  candidate  timesteps. 
Because  new  deques  may  be  created  to  the  left  of  m  at  any 
time,  the  candidate  timesteps  need  not  be  contiguous. 

We  now  bound  the  total  number  of  steal  attempts  that  take 
place  during  the  candidate  timesteps.  Each  such  steal  attempt 
may  result  in  the  execution  of  a  heavy  premature  node;  steals 
in  all  other  timesteps  result  in  the  execution  of  heavy,  but  non¬ 
premature  nodes.  Each  timestep  can  have  at  most  p  steal  at¬ 
tempts.  Therefore,  we  can  partition  the  candidate  timesteps 
into  phases ,  such  that  each  phase  has  between  p  and  2p  —  1 
steal  attempts.  We  call  a  phase  in  interval  U  successful  if  at 
least  one  of  its  0(p)  steal  attempts  targets  the  deque  contain¬ 
ing  ui.  Let  Xij  be  the  random  variable  with  value  1  if  the 
jth  phase  in  interval  Ii  is  successful,  and  0  otherwise.  Be¬ 
cause  targets  for  steal  attempts  are  chosen  at  random  from  the 
leftmost  p  deques  with  uniform  probability,  and  because  each 
phase  has  at  least  P  steal  attempts, 

Pr[Xij  =  l]  > 

> 

> 


Thus,  each  phase  succeeds  with  probability  greater  than  1/2. 
Because  ui  must  get  executed  before  or  by  the  time  xt  -f  1 
successful  steals  target  ut ’s  deque,  there  can  be  at  most  x i  -j-  1 
successful  phases  in  interval  /,.  The  node  ui  may  get  exe¬ 
cuted  before  x%  + 1  steal  attempts  target  its  deque,  if  its  owner 
processor  executes  Ui  off  the  top  of  the  deque.  Let  there  be 
some  rii  <  (xz  -f  1)  successful  phases  in  the  interval  It.  From 
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Figure  8:  (a)  A  portion  of  the  dynamically  unfolding  dag  during  the  execution.  Node  nt+ 1  along  the  path  P  is  ready,  and  is  currently 
present  in  some  deque.  The  deque  is  shown  in  (b);  all  nodes  below  u,+i  on  the  deque  must  be  right  children  of  some  nodes  on  P 
above  t/t+i .  In  this  example,  node  u;+i  was  the  right  child  of  ux ,  and  was  added  to  the  deque  when  the  fork  at  ut  was  executed. 
Subsequently,  descendents  of  the  left  child  of  ut  ( e.g .,  node  d),  may  be  added  to  the  deque  above  1 . 


Lemma  4.1,  the  x  t  nodes  below  m  are  right  children  of  nodes 
in  P.  There  are  (£  —  1)  <  D  nodes  along  P  not  including  us , 
and  each  of  them  may  have  at  most  one  right  child.  Further, 
each  successful  phase  in  any  of  the  first  S  intervals  results  in  at 
least  one  of  these  right  children  (or  the  current  ready  node  on 
P)  being  executed.  Therefore,  the  total  number  of  successful 
phases  in  the  first  S  intervals  is  n*  <  27). 

Finally,  consider  the  final  phase  Is+i .  Let  2  be  the  ready 
node  at  the  start  of  the  interval  with  the  highest  priority.  Then, 
z  g  <rp,  because  otherwise  £  (or  some  other  node),  and  not  v, 
would  have  been  the  last  non-premature  node  to  be  executed 
in  <TP.  Hence,  if  2  is  about  to  be  executed  on  a  processor, 
then  interval  /<5+i  is  empty.  Otherwise,  2  must  be  at  the  top 
of  the  leftmost  deque  at  the  start  of  interval  /<5+i .  Using  an 
argument  similar  to  that  of  Lemma  4.1,  we  can  show  that  the 
nodes  below  2  in  the  deque  must  be  right  children  of  nodes 
along  a  path  from  the  root  to  2 .  Thus,  2  can  have  at  most  ( D  — 
2)  nodes  below  it.  Because  2  must  be  among  the  leftmost  p 
deques  throughout  the  interval  Is + 1 ,  the  phases  in  this  interval 
are  formed  from  all  its  timesteps.  We  call  a  phase  successful 
in  interval  /$+ 1  if  at  least  one  of  the  0(p)  steal  attempts  in  the 
phase  targets  the  deque  containing  2.  Then  this  interval  must 
have  less  than  D  successful  phases.  As  before,  the  probability 
of  a  phase  being  successful  is  at  least  1/2. 

We  have  shown  that  the  first  j  <  r  timesteps  of  the  par¬ 
allel  execution  (i.e.,  the  time  within  which  nodes  from  op  are 
executed)  must  have  <  3D  successful  phases.  Each  phase 
may  result  in  O(p)  heavy  premature  nodes  being  stolen  and 
executed.  Further,  for  i  =  1, . . . ,  8,  in  each  interval  an¬ 
other  p  -  1  heavy  premature  nodes  may  be  executed  in  the 
same  timestep  that  ut  is  executed.  Therefore,  if  <rp  has  a  total 
of  N  phases,  the  number  of  heavy  premature  nodes  in  <rp  is 
at  most  (N  +  D)  *  p.  Because  the  entire  execution  must  have 
less  than  3D  successful  phases,  and  each  phase  succeeds  with 
probability  >  1/2,  the  expected  number  of  total  phases  before 
we  see  3D  successful  phases  is  at  most  6D.  Therefore,  the 
expected  number  of  heavy  premature  nodes  in  ap  is  at  most 
(6D  -f  D)  ■  p  =  0(p  •  D). 

The  high  probability  bound  can  be  proved  as  follows.  Sup¬ 
pose  the  execution  takes  at  least  12  D +8  ln(  1/e)  phases.  Then 


the  expected  number  of  successful  phases  is  at  least  p  —  6D  + 
4  ln(  1/e).  Using  the  Chemoff  bound  [32,  Theorem  4.2]  on 
the  number  of  successful  phases  X,  and  setting  a  =  6£>  4- 
8  In  (1/e),  we  get7 


Pr[A'  <  p  -  a/2 ] 


< 


exp 


“(«/2)2’ 

2p 


Therefore, 

f  .2 


))J 


Because  there  can  be  at  most  3D  successful  phases,  algo¬ 
rithm  DFDeques  requires  12  D  +  8  In  (1/e)  or  more  phases 
with  probability  at  most  e.  Recall  that  each  phase  consists  of 
0(p)  steal  attempts.  Therefore,  <rp  has  0(p  •  (D  -f  In  (1/e))) 
heavy  premature  nodes  with  probability  at  least  1  —  e.  ■ 

We  can  now  state  a  lemma  relating  the  number  of  heavy  pre¬ 
mature  nodes  in  <rp  with  the  memory  requirement  of  sp. 

Lemma  43  Let  G  be  a  dag  with  depth  D,  in  which  every  node 
allocates  at  most  K  space,  and  for  which  the  serial  depth- 
first  execution  requires  S\  space.  Let  sp  be  the  p-schedule  of 
length  T  generated  for  G  by  algorithm  DFDeques(K ).  If  for 
any  i  such  that  1  <  i  <  T,  the  prefix  op  of  sp  representing 
the  computation  after  the  first  i  timesteps  contains  at  most  r 

7The  probability  of  success  for  a  phase  is  not  necessarily  independent  of 
previous  phases;  however,  because  each  phase  succeeds  with  probability  at  least 
1/2,  independent  of  other  phases,  we  can  apply  the  Chemoff  bound. 


Pr[(X  <  3D)]  <  exp 


/  ^ 


[12Z>  +  8ln(l/e)J 
„~2 


=  exp 


< 


4  •  (2a  —  8  ln(l/e 

-a^/8  a 

-a/s 

—  (6D  +  8  ln(l/e))/8 
—8  In  ( 1  /c)  /8 
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Figure  9:  An  example  scenario  when  a  processor  may  not  ex¬ 
ecute  a  contiguous  subsequence  of  nodes  between  steals.  The 
shaded  regions  indicate  the  subset  of  nodes  executed  on  each 
of  the  two  processors,  Pa  and  Pb.  Here,  processor  Pa  steals 
the  thread  t  and  executes  node  u.  It  then  forks  a  child  thread 
(containing  node  i>),  puts  thread  t  on  its  deque,  and  starts  exe¬ 
cuting  the  child.  In  the  mean  time,  processor  Pt,  steals  thread 
t  from  the  deque  belonging  to  Pa ,  and  executes  it  until  it  sus¬ 
pends.  Subsequently,  Pa  finished  executing  the  child  thread, 
and  wakes  up  the  suspended  parent  t  and  resumes  execution 
of  t.  The  combined  sets  of  nodes  executed  on  both  processors 
forms  a  contiguous  subsequence  of  1  DF-schedule. 


heavy  premature  nodes,  then  the  parallel  space  requirement  of 
sp  is  at  most  S\  +  r  •  min  (A",  Si ).  Further,  there  are  at  most 
D  +  r  *  min  (A",  Si )  active  threads  during  the  execution. 

Proof.  We  can  partition  <jp  into  the  set  of  non-premature 
nodes  and  the  set  of  premature  nodes.  Since,  by  definition, 
all  non-premature  nodes  form  some  serial  prefix  of  the 
1  DF-schedule,  their  net  memory  allocation  cannot  exceed  Si . 

We  now  bound  the  net  memory  allocated  by  the  premature 
nodes.  Consider  a  steal  that  results  in  the  execution  of  a  heavy 
premature  node  on  a  processor  Pa .  The  nodes  executed  by  Pa 
until  its  next  steal,  cannot  allocate  more  than  K  space.  Be¬ 
cause  there  are  at  most  r  heavy  premature  nodes  executed,  the 
total  space  allocated  across  all  processors  after  i  timesteps  can¬ 
not  exceed  Si  4-  r  •  K. 

We  can  now  obtain  a  tigher  bound  when  K  >  Si .  Con¬ 
sider  the  case  when  processor  Pa  steals  a  thread  and  executes 
a  heavy  premature  node.  The  nodes  executed  by  Pa  before 
the  next  steal  are  all  premature,  and  form  a  series  of  one  or 
more  subsequences  of  the  1  DF-schedule.  The  intermediate 
nodes  between  these  subsequences  (in  depth-first  order)  are 
executed  on  other  processors  ( e.g .,  see  Figure  9).  These  in¬ 
termediate  nodes  occur  when  other  processors  steal  threads 
from  the  deque  belonging  to  Pa,  and  finish  excecuting  the 
stolen  threads  before  Pa  finishes  executing  all  the  remaining 
threads  in  its  deque.  Subsequently,  when  Pa ’s  deque  becomes 
empty,  the  thread  executing  on  Pa  may  wake  up  its  parent, 
so  that  Pa  starts  executing  the  parent  without  performing  an¬ 
other  steal.  Therefore,  the  set  of  nodes  executed  by  Pa  before 
the  next  steal,  possibly  along  with  premature  nodes  executed 
on  other  processors,  form  a  continguous  subsequence  of  the 
1  DF-schedule. 

Assuming  that  the  net  space  allocated  during  the  1  DF-schedule 
can  never  be  negative,  this  subsequence  cannot  allocate  more 
than  Si  units  of  net  memory.  Therefore,  the  net  memory  allo¬ 


cation  of  all  the  premature  nodes  cannot  exceed  r  •  min(  K,  Si ) , 
and  the  total  space  allocated  across  all  processors  after  i  timesteps 
cannot  exceed  Si+r-  min(A",  Si ).  Because  this  bound  holds 
for  every  prefix  of  sp,  it  holds  for  the  entire  parallel  execution. 

The  maximum  number  of  active  threads  is  at  most  the  num¬ 
ber  of  threads  with  premature  nodes,  plus  the  maximum  num¬ 
ber  of  active  threads  during  a  serial  execution,  which  is  D. 
Assuming  that  each  thread  needs  to  allocate  at  least  a  unit 
of  space  when  it  is  forked  (e.g.,  to  store  its  register  state),  at 
most  min  (A",  Si )  threads  with  premature  nodes  can  be  forked 
for  each  heavy  premature  node  executed.  Therefore,  the  total 
number  of  active  threads  is  at  most  D  +  r  •  min  (A',  Si ).  ■ 

Note  that  each  active  thread  requires  at  most  a  constant 
amount  of  space  to  be  stored  by  the  scheduler  (not  including 
stack  space).  We  now  extend  the  analysis  to  handle  large  allo¬ 
cations. 

Handling  large  allocations  of  space.  We  had  assumed  ear¬ 
lier  in  this  section  that  every  node  allocates  at  most  K  units 
of  memory.  Individual  nodes  that  allocate  more  than  K  space 
are  handled  as  described  in  Section  3.  The  key  idea  is  to  delay 
the  big  allocations,  so  that  if  threads  with  higher  priorities  be¬ 
come  ready,  they  will  be  executed  instead.  The  solution  is  to 
insert  before  every  allocation  of  m  bytes  (m  >  A"),  a  binary 
fork  tree  of  depth  log(m/A"),  so  that  m/K  dummy  threads 
are  created  at  its  leaves.  Each  of  the  dummy  threads  simply 
performs  a  no-op  that  takes  one  timestep,  but  the  threads  at 
the  leaves  of  the  fork  tree  are  treated  as  if  it  were  allocating 
K  space;  a  processor  gives  up  its  deque  and  performs  a  steal 
after  executing  each  of  these  dummy  threads.  Therefore,  by 
the  time  the  m/ K  dummy  threads  are  executed,  a  processor 
may  proceed  with  the  allocation  of  m  bytes  without  exceeding 
our  space  bound.  Recall  that  in  our  cost  model,  an  allocation 
of  m  bytes  requires  a  depth  of  0(log  m);  therefore,  this  trans¬ 
formation  of  the  dag  increases  its  depth  by  at  most  a  constant 
factor.  This  transformation  takes  place  at  runtime,  and  the  on¬ 
line  DFDeques  algorithm  generates  a  schedule  for  this  trans¬ 
formed  dag.  Therefore,  the  final  bound  on  the  space  require¬ 
ment  of  the  generated  schedule,  using  Lemmas  4.2  and  4.3,  is 
stated  below. 

Theorem  4.4  (Upper  bound  on  space  requirement) 

Consider  a  nested-parallel  computation  with  depth  D  and  se¬ 
rial,  depth-first  space  requirement  Si .  Then,  for  any  K  >  0, 
the  expected  value  of  the  space  required  to  execute  the  com¬ 
putation  on  p  processors  using  algorithm  DFDeques(K ),  in¬ 
cluding  the  space  required  to  store  active  threads,  is  Si  + 
0(min(A',  Si)  •  p  *  D).  Further,  for  any  e  >  0,  the  proba¬ 
bility  that  the  computation  requires  Si  -f*  0(min(A",  Sx)  •  p  • 

( D  +  ln(l/e)))  space  is  at  least  1  —  e.  ■ 

We  now  show  that  the  above  space  bound  is  tight  (within 
constant  factors)  in  the  expected  case,  for  algorithm  DF De¬ 
ques. 

Theorem  4.5  (Lower  bound  on  space  requirement) 

For  any  Si  >  0,  p  >  0,  AT  >  0,  and  D  >  24  log  p,  there 
exists  a  nested  parallel  dag  with  a  serial  space  requirement 
of  Si  and  depth  D,  such  that  the  expected  space  required 
by  algorithm  DFDeques(K)  to  execute  it  on  p  processors  is 
Q(Si  +  min( A',  Si )  •  p  ■  D). 

Proof.  Consider  the  dag  shown  in  Figure  10.  The  black  nodes 
denote  allocations,  while  the  grey  nodes  denote  deallocations. 
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The  dag  essentially  has  the  a  fork  tree  of  depth  log  (p/2),  at 
the  leaves  of  which  exist  subgraphs8.  The  root  nodes  of  these 
subgraphs  are  labelled  ui ,  «2, . . . ,  un ,  where  n  =  p/2.  The 
leftmost  of  these  subgraphs.  Go,  shown  in  Figure  10  (b),  con¬ 
sists  of  a  serial  chain  of  d  nodes.  The  remaining  subgraphs  are 
identical,  have  a  depth  of  2c/  +  1,  and  are  shown  in  Figure  10 
(c).  The  amount  of  space  allocated  by  each  of  the  black  nodes 
in  these  subgraphs  is  defined  as  A  =  min  (A",  Si ).  Since  we 
are  constructing  a  dag  of  depth  D ,  the  value  of  d  is  set  such 
that  2d  +  1  +  2  log(p/2)  =  D.  The  space  requirement  of  a 
lDF-schedulefor  this  dag  is  Si . 

We  now  examine  how  algorithm  DFDeques(K)  would  ex¬ 
ecute  such  a  dag.  One  processor  starts  executing  the  root  node, 
and  executes  the  left  child  of  the  current  node  at  each  timestep. 
Thus,  within  log(p/2)  =  log??  timesteps,  it  will  have  exe¬ 
cuted  node  ui .  Now  consider  node  un ;  it  is  guaranteed  to  be 
executed  once  log  ??  successful  steals  target  the  root  thread. 
(Recall  that  the  right  child  of  a  forking  node,  that  is,  the  next 
node  in  the  parent  thread,  must  be  executed  either  before  or 
when  the  parent  thread  is  next  stolen.)  Because  there  are  al¬ 
ways  n  =  p/2  processors  in  this  example  that  are  idle  and  at¬ 
tempt  steals  targetting  p  deques  at  the  start  of  every  timestep, 
the  probability  P3teai  that  a  steal  will  target  a  particular  deque 
is  given  by 

Psteal  > 

>  1  —  e-1/2 

1 


We  call  a  timestep  i  successful  if  some  node  along  the  path 
from  the  root  to  an  gets  executed;  this  happens  when  a  steal 
targets  the  deque  containing  that  node.  Thus,  after  log  ??  suc¬ 
cessful  timesteps,  node  un  must  get  executed;  after  that,  we 
can  consider  every  subsequent  timestep  to  be  successful.  Let 
S  be  the  number  of  successful  timesteps  in  the  first  12  log  n 
timesteps.  Then,  the  expected  value  is  given  by 

E[S]  >  12  log  n  ■  P steal 

>  4  log  ?? 

Using  the  Chemoff  bound  [32,  Theorem  4.2]  on  the  number  of 
successful  timesteps,  we  have 


Pr[S  <  (l  “  j)  '  E[S]  ]  <  exp  -  (j) 


E [S] 
2 


Therefore, 


f  9 

Pr[S  <  log  ??]  <  exp - log  n 

L  8 


—  exp 

<  e 
=  n 


9  In  n 
8  ‘  kG2 

1.62  In  n 

-0.62  }_ 
n 


2  1  f 

<  -  *  —  for  p  >  4 

3  n 


8  All  logarithms  denoted  as  log  are  to  the  base  2. 


Recall  that  ??  =  p/2.  (The  case  of  p  <  4  can  be  easily  han¬ 
dled  separately.)  Let  £t  be  the  event  that  node  ut  is  not  ex¬ 
ecuted  within  the  first  12  log  ??  timesteps.  We  have  showed 
that  Pr[<fn]  <  2/3  ♦  1/ri.  Similarly,  we  can  show  that  for 
each  i  =  l,...,n  -  1,  Pr[£t]  <  2/3  •  1/??.  Therefore, 
Pr[(J"  £i]  <  2/3.  Thus,  for  i  =  1, ...,??,  all  the  u,  nodes 
get  executed  within  the  first  12  log  ??  timesteps  with  probabil¬ 
ity  greater  than  1  /3. 

Each  subgraph  G  has  d  nodes  at  different  depths  that  al¬ 
locate  memory;  the  first  of  these  nodes  cannot  be  executed 
before  timestep  log  n.  Let  t  be  the  first  timestep  at  which 
all  the  ut  nodes  have  been  executed.  Then,  at  this  timestep, 
there  are  at  least  (d  +  log  ??  —  t)  nodes  remaining  in  each  sub¬ 
graph  G  that  allocate  A  bytes  each,  but  have  not  yet  been  ex¬ 
ecuted.  Similarly,  node  w  in  sugraph  Go  will  not  be  executed 
before  timestep  (d  +  log??),  that  is,  another  (d  +  log??  - 
t)  timesteps  after  timestep  t.  Therefore,  for  the  next  (d  -j- 
log  ??  -  t)  timesteps  there  are  always  ??  —  1  —  (p/2)  -  1 
non-empty  deques  (out  of  a  total  of  p  deques)  during  the  ex¬ 
ecution.  Each  time  a  thread  is  stolen  from  one  of  these  de¬ 
ques,  a  black  node  (see  Figure  10  (c))  is  executed,  and  the 
thread  then  suspends.  Because  p/2  processors  become  idle 
and  attempt  a  steal  at  the  start  of  each  timestep,  we  can  show 
that  in  the  expected  case,  at  least  a  constant  fraction  of  the 
p/2  steals  are  successful  in  every  timestep.  Each  successful 
steal  results  in  A  =  min  (Si,  A”)  units  of  memory  being  al¬ 
located.  Consider  the  case  when  £  =  12  log  ??,  Then,  using 
linearity  of  expectations,  over  the  d  -  1 1  log  ??  timesteps  after 
timestep  f,  the  expected  value  of  the  total  space  allocated  is 
Si  +  fi(A  p-  (d-  11  log  ??))  =  Si  +  Q(A  -  p-(D~- log  p)). 
( D  >  24  log  p  ensures  that  (d  -  1 1  log  ??)  >  0.) 

We  showed  that  with  constant  probability  ( >  1/3),  all  the 
ii,  nodes  will  be  executed  within  the  first  12  log  ??  timesteps. 
Therefore,  in  the  expected  case,  the  space  allocated  (at  some 
point  during  the  execution  after  all  ut  nodes  have  been  exe¬ 
cuted)  is  Q(Si  4-  min(Si ,  A')  •  (D  —  logp)  •  p).  ■ 


Corollary  4.6  (Lower  bound  using  work  stealing) 

For  any  Si  >  0,  p  >  0,  and  D  >  24  log  p,  there  exists  a 
nested  parallel  dag  with  a  serial  space  requirement  of  Si  and 
depth  D,  such  that  the  expected  space  required  to  execute  it 
using  the  space-efficient  work  stealer  from  [9]  on  p  processors 
isQ(Si'p-D).  I 


The  corollary  follows  from  Theorem  4.5  and  the  fact  that  algo¬ 
rithm  DFDeques  behaves  like  the  space-efficient  work-stealing 
scheduler  for  K  =  oo.  Blumofe  and  Leiserson  [9]  presented 
an  upper  bound  on  space  of  p  •  Si  using  randomized  work 
stealing.  Their  result  is  not  inconsistent  with  the  above  corol¬ 
lary,  because  their  analysis  allows  only  “stack-like”  memory 
allocation9,  which  is  more  restricted  than  our  model.  For  such 
restricted  dags,  their  space  bound  of  p  •  Si  also  applies  directly 
to  DFDeques(oc).  Our  lower  bound  is  also  consistent  with  the 
upper  bound  of  p  •  S  by  Simpson  and  Burton  [41],  where  S  is 
the  maximum  space  requirement  over  all  possible  depth-first 
schedules;  in  this  example,  S  =  Si  ■  D. 

9Their  model  does  not  allow  allocation  of  space  on  a  global  heap.  An  in¬ 
struction  in  a  thread  may  allocate  stack  space  only  if  the  thread  cannot  possibly 
have  a  living  child  when  the  instruction  is  executed.  The  stack  space  allocated 
by  the  thread  must  be  freed  when  the  thread  terminates. 
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Figure  10:  (a)  The  dag  for  which  the  existential  lower  bound  holds,  (b)  and  (c)  present  the  details  of  the  subgraphs  shown  in 
(a).  The  black  nodes  denote  allocations  and  grey  nodes  denote  deallocations;  the  nodes  are  marked  with  the  amount  of  memory 
(de)allocated. 
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4.3  Time  bound 

We  now  prove  the  time  bound  required  for  a  parallel  computa¬ 
tion  using  algorithm  DFDeques.  This  time  bound  does  not  in¬ 
clude  the  scheduling  costs  of  maintaining  the  relative  order  of 
the  deques  (i.e.,  inserting  and  deleting  deques  in  Tv),  or  finding 
the  m-  deque.  Elsewhere  [33],  we  describe  how  the  scheduler 
can  be  parallelized,  and  then  prove  the  time  bound  including 
these  scheduling  costs.  We  first  assume  that  every  action  al¬ 
locates  at  most  K  space,  for  some  constant  A\  and  prove  the 
time  bound.  We  then  relax  this  assumption  and  provide  the 
modified  time  bound  at  the  end  of  this  subsection. 

Lemma  4.7  Consider  a  parallel  computation  with  work  W 
and  depth  D,  in  which  every  action  allocates  at  most  K  space. 
The  expected  time  to  execute  this  computation  on  p  processors 
using  the  DFDeques(K)  scheduling  algorithm  is  0(W/p  + 
D).  Further,  for  any  e  >  0,  the  time  required  to  execute  the 
computation  is  0(W/p  +  D  +  ln(l/e))  with  probability  at 
least  1  —  e. 

Proof.  Consider  any  timestep  i  of  the  p-schedule;  let  nt  be 
the  number  of  deques  in  ft  at  timestep  i.  We  first  classify  each 
timestep  i  into  one  of  two  types  (A  and  B),  depending  on  the 
value  of  nt.  We  then  bound  the  total  number  of  timesteps  Ta 
and  Tb  of  types  A  and  B,  respectively. 

Type  A:  ni  >  p.  At  the  start  of  timestep  i,  let  there  be  r  < 
p  steal  attempts  in  this  timestep.  Then  the  remaining  p  —  r 
processors  are  busy  executing  nodes,  that  is,  at  least  p  —  r 
nodes  are  executed  in  timestep  i.  Further,  at  most  p  —  r  of  the 
leftmost  p  deques  may  be  empty;  the  rest  must  have  at  least 
one  thread  in  them. 

Let  Xj  be  the  random  variable  with  value  1  if  the  jth  non¬ 
empty  deque  in  ft  (from  the  left  end)  gets  exactly  one  steal 
request,  and  0  otherwise.  Then,  E  [Ay]  =  Pr  [Xj  =  1]  = 
(r ]p)  •  (1  —  l/p)r_1 .  Let  A'  be  the  random  variable  repre¬ 
senting  the  total  number  of  non-empty  deques  that  get  exactly 
one  steal  request.  Because  there  are  at  least  r  non-empty  de¬ 
ques,  the  expected  value  of  A'  (assuming  that/;>  >  2)  is  given 
by 


E[A]  > 


> 

> 

> 

Recall  that  p  —  r  nodes  are  executed  by  the  busy  processors. 
Therefore,  if  Y  is  the  random  variable  denoting  the  total  num¬ 
ber  of  nodes  executed  during  this  timestep,  then 


E[Y]  > 

(P  -  r)  +  r2/2ep 

> 

p/2e 

Therefore,  E  [p  —  Y]  < 

P  -  P/2e 

= 

P(1  “  l/2e) 

j=i 


2  •  p  •  e 


The  quantity  (p  -  1')  must  be  non-negative;  therefore,  using 
the  Markov’s  inequality  [32,  Theorem  3.2],  we  get 


Prt(p-y)  >p(l-l/4e)] 

< 

E[(p-V)] 
"  &) 

< 

(i  -  £) 

(i-*) 

Therefore,  Pr  [V'  <  p/4e] 

< 

9 

To 

that  is,  Pr  [V  >  p/4e] 

> 

1 

10 

We  will  call  each  timestep  of  type  A  successful  if  at  least 
p/4e  nodes  get  executed  during  the  timestep.  Then  the  proba¬ 
bility  of  the  timestep  being  successful  is  at  least  1  / 10.  Because 
there  are  W  nodes  in  the  entire  computation,  there  can  be  at 
most  4e  •  W/p  successful  timesteps  of  type  A.  Therefore,  the 
expected  value  for  Ta  is  at  most  40e  •  W/p. 

The  analysis  of  the  high  probability  bound  is  similar  to 
that  for  Lemma  4.2.  Suppose  the  execution  takes  more  than 
80  eW/p  +  40  ln(  1/e )  timesteps  of  type  A.  Then  the  expected 
number  p  of  successful  timesteps  of  type  A  is  at  least  8eW/p+ 
4  ln(l/e).  If  Z  is  the  random  variable  denoting  the  total  num¬ 
ber  of  successful  timesteps,  then  using  the  Cheraoff  bound  [32, 
Theorem  4.2],  and  setting  a  =  AOeW/p+AO  ln(l/e),  we  get10 


Pr  [Z  <  p  —  a/10] 


< 


exp 


-(«/io)2 

2p 


Therefore, 

Pi[Z  <  AeW/p]  <  e-aS/200'‘ 


=  exp 


<  exp 


200(a/5  —  4  In  (1/e)) 


[  200*  a/5  J 

=  e-a/4° 

_  g  — eVT/p— ln(l/<) 


<  e 
=  e 


-ln(l/e) 


We  have  shown  that  the  execution  will  not  complete  even  after 
80  eW/p  +  40  In  (1/e)  type  A  timesteps  with  probability  at 
most  e.  Thus,  for  any  c  >  0,  Ta  =  0(W/p  +  ln(l/e))  with 
probability  at  least  1  —  e. 

Type  B :  nt  <  p.  We  now  consider  timesteps  in  which  the 
number  of  deques  in  ft  is  less  than  p.  As  with  the  proof  of 
Lemma  4.2,  we  split  type  B  timesteps  into  phases  such  that 
each  phase  has  between  p  and  2p  —  1  steal  attempts.  We  can 
then  use  a  potential  function  argument  similar  to  the  dedicated 
machine  case  by  Arora  et  al.  [2].  Composing  phases  from  only 
type  B  timesteps  (ignoring  type  A  timesteps)  retains  the  valid¬ 
ity  of  their  analysis.  We  briefly  outline  the  proof  here.  Nodes 
are  assigned  exponentially  decreasing  potentials  starting  from 

10  As  with  the  proof  of  Lemma  4.2,  we  can  use  the  Chcmoff  bound  here  be¬ 
cause  each  timestep  succeeds  with  probability  at  least  1/10,  even  if  the  exact 
probabilities  of  successes  for  timesteps  are  not  independent. 
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the  root  downwards.  Thus,  a  node  at  a  depth  of  d  is  assigned 
a  potential  of  32^  “d) ,  and  in  the  timestep  in  which  it  is  about 
to  be  executed  on  a  processor,  a  weight  of  They 

show  that  in  any  phase  during  which  between p  and  2p—l  steal 
attempts  occur,  the  total  potential  of  the  nodes  in  all  the  deques 
drops  by  a  constant  factor  with  at  least  a  constant  probability. 
Since  the  potential  at  the  start  of  the  execution  is  32D~l,  the 
expected  value  of  the  total  number  of  phases  is  O(D).  The 
difference  with  our  algorithm  is  that  a  processor  may  execute 
a  node,  and  then  put  up  to  2  (instead  of  1)  children  of  the  node 
on  the  deque  if  it  runs  out  of  memory;  however,  this  differ¬ 
ence  does  not  violate  the  basis  of  their  arguments.  Since  each 
phase  has  0(p)  steal  attempts,  the  expected  number  of  steal 
attempts  during  type  B  timesteps  is  O(pD).  Further,  for  any 
e  >  0,  we  can  show  that  the  total  number  of  steal  attempts 
during  timesteps  of  type  B  is  0(p  *  (D  +  ln(l/e)))  with  prob¬ 
ability  at  least  1  —  e. 

Recall  that  in  every  timestep,  each  processor  either  exe¬ 
cutes  a  steal  attempt  that  fails,  or  executes  a  node  from  the  dag. 
Therefore,  if  iVsteai  is  the  total  the  number  of  steal  attempts 
during  type  B  timesteps,  then  TB  is  at  most  (W  +  ATsteai)/p. 
Therefore,  the  expected  value  for  T#  is  0(  W/p  +  D),  and  for 
anye  >  0,  the  number  of  timesteps  is  0(W/p+D +]n(l/e)) 
with  probability  at  least  1  —  e. 

The  total  number  of  timesteps  in  the  entire  execution  is 
Ta  +  Tb.  Therefore,  the  expected  number  of  timesteps  in 
the  execution  is  0(W/p  -f  D  ).  Further,  combining  the  high 
probability  bounds  for  timesteps  of  type  A  and  B,  (and  using 
the  fact  that  P(A  U  B)  <  P{A)  4-  P(B)),  we  can  show  that 
for  any  e  >  0,  the  total  number  of  timesteps  in  the  parallel 
execution  is  0(W/p  -f  D  +  ln(l/e))  with  probability  at  least 
1  -  e.  ■ 

To  handle  each  large  allocation  of  m  units  (where  m  > 
K\  recall  that  we  add  [m/K\  dummy  threads;  the  dummy 
threads  are  forked  in  a  binary  tree  of  depth  0(log(m/A")). 
Because  we  assume  a  depth  of  0(log  m)  for  every  allocation 
of  m  bytes,  this  transformation  of  the  dag  increases  its  depth 
by  at  most  a  constant  factor.  If  Sa  is  the  total  space  allocated 
in  the  program  (not  counting  the  deallocations),  the  number  of 
nodes  in  the  transformed  dag  is  atmost  W+Sa/K.  Therefore, 
using  Lemma 4.7,  the  modified  time  bound  is  stated  as  follows. 

Theorem  4.8  (Upper  bound  on  time  requirement) 

The  expected  time  to  execute  a  parallel  computation  with  W 
work,  D  depth,  and  total  space  allocation  Sa  on  p  processors 
using  algorithm  DFDeques(K)  is  0(W/p  +  Sa/pK  +  D). 
Further,  for  any  e  >  0,  the  time  required  to  execute  the  com¬ 
putation  is  0{  W/p  +  Sa  /  p  K  +  D- b  ln( 1/e) )  with  probability 
at  least  1  —  e. 

In  a  system  where  every  memory  location  allocated  must  be 
zeroed,  Sa  =  0(W).  The  expected  time  bound  therefore  be¬ 
comes  0(W/p  +  D).  This  time  bound,  although  asymptoti¬ 
cally  optimal  [10],  is  not  as  low  as  the  time  bound  of  W/p  + 
O(D)  for  work  stealing  [9]. 

Trade-off  between  space,  time,  and  scheduling  granular¬ 
ity.  As  the  memory  threshold  K  is  increased,  the  scheduling 
granularity  increases,  since  a  processor  can  execute  more  in¬ 
structions  between  steals.  In  addition,  the  number  of  dummy 
threads  added  before  large  allocations  decreases.  However,  the 
space  requirement  increases  with  K.  Thus,  adjusting  the  value 
of  K  provides  a  trade-off  between  running  time  (or  scheduling 
granularity),  and  space  requirement. 


5  Experiments  with  Pth  reads 

We  implemented  the  scheduler  as  part  of  an  existing  library 
for  Posix  standard  threads  or  Pthreads  [23].  The  library  is 
the  native,  user-level  Pthreads  library  on  Solaris  2.5  [38,  43]. 
Pthreads  on  Solaris  are  multiplexed  at  the  user  level  on  top  of 
kernel  threads,  which  act  like  virtual  processors.  The  original 
scheduler  in  the  Pthread  library  uses  a  FIFO  queue.  Our  ex¬ 
periments  were  conducted  on  an  8  processor  Enterprise  5000 
SMP  with  2GB  main  memory.  Each  processor  is  a  167  MHz 
UltraSPARC  with  a  512  kB  L2  cache. 

Having  to  support  the  general  Pthreads  functionality  pre¬ 
vents  even  a  user-level  Pthreads  implementation  from  being 
extremely  lightweight.  For  example,  a  thread  creation  is  two 
orders  of  magnitude  more  expensive  than  a  null  function  call 
on  the  UltraSPARC.  Therefore,  the  user  is  required  to  create 
Pthreads  that  are  coarse  enough  to  amortize  the  cost  of  thread 
operations.  However,  with  a  depth-first  scheduler,  threads  at 
this  granularity  had  to  be  coarsened  further  to  get  good  parallel 
performance  [35].  We  show  that  using  algorithm  DFDeques , 
good  speedups  can  be  achieved  using  Pthreads  without  this 
additional  coarsening.  Thus,  the  user  can  now  fix  the  thread 
granularity  to  amortize  thread  operation  costs,  and  expect  to 
get  good  parallel  performance  in  both  space  and  time. 

The  Pthreads  model  supports  a  binary  fork  and  join  mech¬ 
anism.  We  modified  memory  allocation  routines  malloc  and 
free  to  keep  track  of  the  memory  quota  of  the  current  pro¬ 
cessor  (or  kernel  thread)  and  to  fork  dummy  threads  before 
an  allocation  if  required.  Our  scheduler  implementation  is 
a  simple  extension  of  algorithm  DFDeques  that  supports  the 
full  Pthreads  functionality  (including  blocking11  mutexes  and 
condition  variables)  by  maintaining  additional  entries  in  H  for 
threads  suspended  on  synchronizations.  Our  benchmarks  are 
predominantly  nested  parallel,  and  make  limited  use  of  mu¬ 
texes  and  condition  variables.  For  example,  the  tree-building 
phase  in  Bames-Hut  uses  mutexes  to  protect  modifications  to 
the  tree’s  cells.  However,  the  Solaris  Pthreads  implementation 
itself  makes  extensive  use  of  blocking  synchronization  primi¬ 
tives  such  as  Pthread  mutexes  and  condition  variables. 

Since  our  execution  platform  is  an  SMP  with  a  modest 
number  of  processors,  access  to  the  ready  threads  in  H  was 
serialized.  H  is  implemented  as  a  linked  list  of  deques  pro¬ 
tected  by  a  shared  scheduler  lock.  We  optimized  the  common 
cases  of  pushing  and  popping  threads  onto  a  processor’s  cur¬ 
rent  deque  by  minimizing  locking  time.  A  steal  requires  the 
lock  to  be  acquired  more  often  and  for  a  longer  period  of  time. 

In  the  existing  Pthreads  implementation,  it  is  not  always 
possible  to  place  a  reawakened  thread  on  the  same  deque  as  the 
thread  that  wakes  it  up;  therefore,  our  implementation  of  DFD¬ 
eques  is  an  approximation  of  the  pseudocode  in  Figure  5.  Fur¬ 
ther,  since  we  serialize  access  to  H ,  and  support  mutexes  and 
condition  variables,  setting  the  memory  threshold  K  to  infin¬ 
ity  does  not  produce  the  same  schedule  as  the  space-efficient 
work-stealing  scheduler  intended  for  fully  strict 
computations  [9].  Therefore,  we  can  use  this  setting  only  as 
a  rough  approximation  of  a  pure  work-stealing  scheduler. 

We  first  list  the  benchmarks  used  in  our  experiments.  Next, 
we  compare  the  space  and  time  performance  of  the  library’s 
original  scheduler  (labelled  “FIFO”)  with  an  asynchronous, 
depth-first  scheduler  [35]  (labelled  “ADF”),  and  the  new  DFD¬ 
eques  scheduler  (labelled  “DFD”)  for  a  fixed  value  of  the  mem¬ 
ory  threshold  K.  We  also  use  DFDeques(o o)  as  an  approx- 

11  We  use  the  term  “blocking”  for  synchronization  that  causes  the  calling 
thread  to  block  and  suspend,  rather  than  spin  wait. 
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imation  for  a  work-stealing  scheduler  (labelled  “DFD-inf ’). 
To  study  how  the  performance  of  the  schedulers  is  affected  by 
thread  granularity,  we  present  results  of  the  experiments  at  two 
different  thread  granularities.  Finally,  we  measure  the  trade¬ 
off  between  running  time,  scheduling  granularity,  and  space 
for  algorithm  DFDeques  by  varying  the  value  of  the  memory 
threshold  K  for  one  of  the  benchmarks. 

5.1  Parallel  benchmarks 

The  benchmarks  were  either  adapted  from  publicly  available 
coarse  grained  versions  [  1 9, 36, 42, 46],  or  written  from  scratch 
using  the  lightweight  threads  model  [35].  The  parallelism  in 
both  divide-and-conquer  recursion  and  parallel  loops  was  ex¬ 
pressed  as  a  binary  tree  of  forks,  with  a  separate  Pthread  cre¬ 
ated  for  each  recursive  call.  Thread  granularity  was  adjusted 
by  serializing  the  recursion  near  the  leafs.  In  the  comparison 
results  in  Section  5.2,  medium  granularity  refers  to  the  thread 
granularity  that  provides  good  parallel  performance  using  the 
depth-first  scheduler  [35].  Even  at  medium  granularity,  the 
number  of  threads  significantly  exceeds  the  number  of  proces¬ 
sors;  this  allows  simple  coding  and  automatic  load  balancing, 
while  resulting  in  performance  equivalent  to  hand-partitioned, 
coarse-grained  code  using  the  depth-first  scheduler  [35].  Fine 
granularity  refers  to  the  finest  thread  granularity  that  allows  the 
cost  of  thread  operations  in  a  single-processor  execution  to  be 
up  to  5%  of  the  serial  execution  time12.  The  benchmarks  are 
volume  rendering,  dense  matrix  multiply,  sparse  matrix  multi¬ 
ply,  Fast  Fourier  Transform,  Fast  Multipole  Method,  Bames- 
Hut,  and  a  decision  tree  builder13.  Figure  11  lists  the  total 
number  of  threads  expressed  in  each  benchmark  at  both  the 
thread  granularities. 

5.2  Comparison  results 

In  all  the  comparison  results,  we  use  a  memory  threshold  of 
K  =  50, 000  bytes  for  “ADF”  and  “DFD”1  .  Each  active 
thread  is  allocated  a  minimum  8kB  (1  page)  stack.  Therefore, 
the  space-efficient  schedulers  effectively  conserve  stack  mem¬ 
ory  by  creating  fewer  simultaneously  active  threads  compared 
to  the  original  FIFO  scheduler  (see  Figure  11).  The  FIFO 
scheduler  spends  significant  portions  of  time  executing  system 
calls  related  to  memory  allocation  for  the  thread  stacks  [35]; 
this  problem  is  aggravated  when  the  threads  are  made  fine 
grained. 

The  8-processor  speedups  for  all  the  benchmarks  at  medium 
and  fine  thread  granularities  are  shown  in  Figure  12.  To  con¬ 
centrate  on  the  effect  of  the  scheduler,  and  to  ignore  the  ef¬ 
fect  of  increased  thread  overheads  (up  to  5%  for  all  except 
dense  matrix  multiply)  at  the  fine  granularity,  speedups  for 
each  thread  granularity  are  with  respect  to  the  single-processor 
multithreaded  execution  at  that  granularity.  The  speedups  show 
that  both  the  depth-first  scheduler  and  the  new  DFDeques  sched¬ 
uler  outperform  the  library’s  original  FIFO  scheduler.  How¬ 
ever,  at  the  fine  thread  granularity,  the  new  scheduler  provides 
better  performance  than  the  depth-first  scheduler.  This  differ¬ 
ence  can  be  explained  by  the  better  locality  and  lower  schedul¬ 
ing  contention  experienced  by  algorithm  DFDeques. 

12  The  exception  was  the  dense  matrix  multiply,  which  we  wrote  for  n  x  n 
blocks,  where  n  is  a  power  of  two.  Therefore,  fine  granularity  involved  reducing 
the  block  size  by  a  factor  of  4,  and  increasing  the  number  of  threads  by  a  factor 
of  8,  resulting  in  10%  additional  overhead. 

13Details  on  the  benchmarks  can  be  found  elsewhere  [33]. 

14In  the  depth-first  scheduler,  the  memory  threshold  K  is  the  memory  quota 
assigned  to  each  thread  between  thread  preemptions  [35], 
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Figure  12:  Speedups  on  8  processors  with  respect  to  single¬ 
processor  executions  for  the  three  schedulers  (the  original 
“FIFO”,  the  depth-first  “ADF’,  and  the  new  “DFD”  or 
DFDeques)  at  both  medium  and  fine  thread  granularities, 
with  K  -  50,000  bytes.  Performance  of  “DFD-inf’  (or 
DFDeques( oo)),  being  very  similar  to  that  of  “DFD”, 
is  not  shown  here.  All  benchmarks  were  compiled  us¬ 
ing  cc  -fast  -xarch=v8plusa  -xchip=ultra 
-xtarget=native  -x04. 
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Figure  11:  Input  sizes  for  each  benchmark,  total  number  of  threads  expressed  in  the  program  at  medium  and  fine  granularities,  and 
max.  number  of  simultaneously  active  threads  created  by  each  scheduler  at  both  granularities,  for  K  =  50,000  bytes.  “DFD-inf” 
creates  at  most  twice  as  many  threads  as  “DFD”  for  Dense  MM,  and  at  most  15%  more  threads  than  “DFD”  for  the  remaining 
benchmarks. 


Figure  13:  Variation  of  the  memory  requirement  with  the  num¬ 
ber  of  processors  for  dense  matrix  multiply  using  three  sched¬ 
ulers:  depth-first  (“ADF”),  DFDeques  (“DFD”),  and  Cilk 
(“Cilk”). 

We  measured  the  external  (L2)  cache  miss  rates  for  each 
benchmark  using  on-chip  UltraSPARC  performance  counters. 
Figure  1,  which  lists  the  results  at  the  fine  thread  granularity, 
shows  that  our  scheduler  achieves  relatively  low  cache  miss 
rates  (i.e.,  results  in  better  locality). 

Three  out  of  the  seven  benchmarks  make  significant  use 
of  heap  memory.  For  these  benchmarks,  we  measured  the 
high  water  mark  for  heap  memory  allocation  using  the  three 
schedulers.  Figure  14  shows  that  algorithm  DFDeques  results 
in  slightly  higher  heap  memory  requirement  compared  to  the 
depth-first  scheduler,  but  still  outperforms  the  original  FIFO 
scheduler. 

The  Cilk  runtime  system  [20]  uses  a  provably  space-efficient 
work  stealing  algorithm  to  schedule  threads15.  Figure  13  com¬ 
pares  the  space  performance  of  Cilk  with  the  depth-first  and 
DFDeques  schedulers  for  the  dense  matrix  multiply  bench¬ 
mark  (at  the  fine  thread  granularity).  The  figure  indicates  that 
DFDeques  requires  more  memory  than  the  depth-first  sched¬ 
uler,  but  less  memory  than  Cilk.  In  particular,  similar  to  the 
depth-first  scheduler,  the  memory  requirement  of  DFDeques 
increases  slowly  with  the  number  of  processors. 

5.3  Measuring  the  tradeoff  between  space,  time,  and 
scheduling  granularity 

We  studied  the  effect  of  the  size  of  memory  threshold  K  on 
the  running  time,  memory  requirement,  and  scheduling  granu- 

15Because  Cilk  requires  gcc  to  compile  the  benchmarks  (which  results  in 
slower  code  for  floating  point  operations  compared  to  the  native  cc  compiler 
on  UltraSPARCs),  we  do  not  show  a  direct  comparison  of  running  times  or 
speedups  of  Cilk  benchmarks  with  our  Pthreads-based  system  here. 


larity  using  DFDeques(K).  Each  processor  keeps  track  of  the 
number  of  times  a  thread  from  its  own  deque  is  scheduled,  and 
the  number  of  times  it  has  to  perform  a  steal.  The  ratio  of  these 
two  counts,  averaged  over  all  the  processors,  is  our  approx¬ 
imation  of  the  scheduling  granularity.  The  trade-off  is  best 
illustrated  in  the  dense  matrix  multiply  benchmark,  which  al¬ 
locates  significant  amounts  of  heap  memory.  Figure  15  shows 
the  resulting  trade-off  for  this  benchmark  at  the  fine  thread 
granularity.  As  expected,  both  memory  and  scheduling  gran¬ 
ularity  increase  with  K ,  while  running  time  reduces  as  K  is 
increased. 

6  Simulating  the  schedulers 

To  compare  algorithm  DFDeques  with  a  work-stealing  sched¬ 
uler,  we  built  a  simple  system  that  simulates  the  parallel  execu¬ 
tion  of  synthetic,  nested-parallel,  divide-and-conquer  bench¬ 
marks16.  Our  implementation  simulates  the  execution  of  the 
space-efficient  work-stealing  scheduler  [9]  (labeled  “WS”),  the 
space-efficient,  asynchronous  depth-first  scheduler  [34]  (“ADF’), 
and  our  new  DFDeques  scheduler  (labeled  “DFD”). 

Due  to  limited  space,  we  present  results  for  only  one  of 
the  synthetic  benchmarks  here17,  in  which  both  the  memory 
requirement  and  the  thread  granularity  decrease  geometrically 
down  the  recursion  tree.  A  number  of  divide-and-conquer 
programs  exhibit  such  properties.  Scheduling  granularity  was 
measured  as  the  average  number  of  actions  executed  by  a  pro¬ 
cessor  between  two  steals.  Figure  1 6  shows  that  work  stealing 
results  in  high  scheduling  granularity  and  high  space  require¬ 
ment,  the  depth  first  scheduler  results  in  low  scheduling  gran¬ 
ularity  and  low  space  requirement,  while  DFDeques  allows 
scheduling  granularity  to  be  traded  with  space  requirement  by 
varying  the  memory  threshold  K . 

7  Summary  and  Discussion 

Depth-first  schedulers  are  space-efficient,  but  unlike  work¬ 
stealing  schedulers,  they  require  the  user  to  explicitly  increase 
the  thread  granularity  beyond  what  is  required  to  amortize 
basic  thread  costs.  In  contrast,  algorithm  DFDeques  auto¬ 
matically  increases  the  scheduling  granularity  by  executing 

16  To  model  irregular  applications,  the  space  and  time  requirements  of  a  thread 
at  each  level  of  the  recursion  are  selected  uniformly  at  random  with  the  specified 
mean. 

17  Results  for  other  benchmarks  and  a  detailed  description  of  the  simulator  can 
be  found  elsewhere  [33]. 
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Figure  14:  High  water  mark  of  heap  memory  allocation  (in  MB)  on  8  processors  for  benchmarks  involving  dynamic  memory 
allocation  (A  =  50,000  bytes  for  “ADF”  and  “DFD”),  at  both  thread  granularities.  “DFD-inf”  is  our  approximation  of  work 
stealing  using  DFDeques(oo ). 
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Figure  15:  Trade-off  between  running  time,  memory  allocation  and  scheduling  granularity  using  algorithm  DFDeques  as  the 
memory  threshold  K  is  varied,  for  the  dense  matrix  multiply  benchmark  at  fine  thread  granularity. 
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Figure  16:  Simulation  results  for  a  divide-and-conquer  benchmark  with  15  levels  of  recursion  running  on  64  processors.  The  mem¬ 
ory  requirement  and  thread  granularity  decrease  geometrically  (by  a  factor  of  2)  down  the  recursion  tree.  Scheduling  granularity  is 
shown  as  a  percentage  of  the  total  work  in  the  dag.  “WS”  is  the  space-efficient  work-stealing  scheduler,  “ADF’  is  the  space-efficient 
depth-first  scheduler,  and  “DFD”  is  our  new  DFDeques  scheduler. 
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Figure  17:  Speedups  for  the  tree-building  phase  of  Barnes 
Hut  (for  1M  particles).  The  phase  involves  extensive  use  of 
locks  on  cells  of  the  tree  to  ensure  mutual  exclusion.  The 
Pthreads-based  schedulers  (all  except  Cilk)  support  blocking 
locks.  “DFD”  does  not  result  in  a  large  scheduling  granular¬ 
ity  due  to  frequent  suspension  of  the  threads  on  locks;  there¬ 
fore,  its  performance  is  similar  to  that  of  “ADF”.  Cilk  [20] 
uses  a  pure  work  stealer  and  supports  spin  waiting  locks.  For 
this  benchmark,  the  single-processor  execution  time  on  Cilk  is 
comparable  with  that  on  the  Pthreads-based  system. 

neighboring,  fine-grained  threads  on  the  same  processor  to 
yield  good  locality  and  low  scheduling  contention.  In  the¬ 
ory,  for  nested-parallel  programs  with  a  large  amount  of  par¬ 
allelism,  algorithm  DFDeques  has  a  lower  space  bound  than 
work-stealing  schedulers.  We  showed  that  in  practice,  it  re¬ 
quires  more  memory  than  a  depth-first  scheduler,  and  less  mem¬ 
ory  than  work  stealing.  DFDeques  also  allows  the  user  to  con¬ 
trol  the  trade-off  between  space  requirement  and  running  time 
(or  scheduling  granularity).  Because  algorithm  DFDeques  al¬ 
lows  more  deques  than  processors,  it  can  be  easily  extended 
to  support  blocking  synchronizations.  For  example,  prelimi¬ 
nary  results  on  a  benchmark  which  makes  a  significant  use  of 
locks,  indicate  that  DFDeques  with  blocking  locks  results  in 
better  performance  than  a  work  stealer  that  uses  spin-waiting 
locks  (see  Figure  17). 

Since  Pthreads  are  not  very  lightweight,  serializing  access 
to  the  set  of  ready  threads  %  did  not  significantly  affect  the 
performance  in  our  implementation.  However,  serial  access 
to  H  can  become  a  bottleneck  if  threads  are  extremely  fine 
grained,  and  require  frequent  suspension  due  to  memory  allo¬ 
cation  or  synchronization.  To  support  such  threads,  the  schedul¬ 
ing  operations  (such  as  updates  to  U)  need  to  be  parallelized  [33]. 

Each  processor  in  DFDeques  treats  its  deque  as  a  regular 
stack.  Therefore,  in  a  system  that  supports  very  lightweight 
threads,  the  algorithm  should  benefit  from  stack-based  opti¬ 
mizations  such  as  lazy  thread  creation  [21,  31];  these  meth¬ 
ods  avoid  allocating  resources  for  a  thread  unless  it  is  stolen, 
thereby  making  most  thread  creations  nearly  as  cheap  as  func¬ 
tion  calls. 

Increasing  scheduling  granularity  typically  serves  to  en¬ 
hance  data  locality  on  SMPs  with  limited-size,  hardware- 
coherent  caches.  However,  on  distributed  memory  machines 
(or  software-coherent  clusters),  executing  threads  where  the 
data  permanently  resides  becomes  important.  A  multi-level 
scheduling  strategy  may  allow  the  thread  implementation  to 
scale  to  clusters  of  SMPs.  For  example,  the  DFDeques  al¬ 
gorithm  could  be  deployed  within  a  single  SMP,  while  some 
scheme  based  on  data  affinity  is  used  across  SMPs. 

An  open  question  is  how  to  automatically  find  the  appro¬ 
priate  value  of  the  memory  threshold  K,  which  may  depend  on 
the  benchmark,  and  on  the  thread  implementation.  One  pos¬ 
sible  solution  is  for  the  user  (or  the  runtime  system)  to  set  K 
to  an  appropriate  value  after  running  the  program  for  a  range 


of  values  of  K  on  smaller  input  sizes.  Alternatively,  it  may  be 
possible  for  the  system  to  keep  statistics  to  dynamically  set  K 
to  an  appropriate  value  during  the  execution. 
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